AtlasML Troubleshooting
This guide covers common production issues encountered by AtlasML administrators and how to resolve them.
Service Health Issues
Container Won't Start
Symptom:
docker ps
# atlasml is not listed
Diagnosis:
# Check logs
docker logs atlasml
# Check exit code
docker inspect atlasml | jq '.[0].State.ExitCode'
# View last run time
docker inspect atlasml | jq '.[0].State.StartedAt'
Common Causes & Solutions:
1. Missing Environment Variables
Error in logs:
KeyError: 'ATLAS_API_KEYS'
Solution:
# Check .env file exists
ls -la /opt/atlasml/.env
# Verify required variables
cat /opt/atlasml/.env | grep -E "(ATLAS_API_KEYS|WEAVIATE_HOST|WEAVIATE_PORT)"
# If missing, add them
nano /opt/atlasml/.env
2. Weaviate Not Running
Error in logs:
WeaviateConnectionError: Could not connect to Weaviate
Solution:
# Check if Weaviate is accessible
curl -H "Authorization: Bearer YOUR_WEAVIATE_API_KEY" https://your-weaviate-domain.com/v1/.well-known/ready
# Should return: {"status":"ok"}
# If Weaviate is not accessible, check the centralized Weaviate service
# (Weaviate runs on a separate server - see /weaviate directory)
# Restart AtlasML
docker-compose -f docker-compose.prod.yml restart atlasml
3. Port Already in Use
Error in logs:
ERROR: bind: address already in use: 0.0.0.0:80
Solution:
# Find process using port 80
sudo lsof -i :80
# or
sudo netstat -tulpn | grep :80
# Stop conflicting service
sudo systemctl stop nginx # or apache2
# Or change AtlasML port in compose file
# Edit docker-compose.prod.yml:
# ports:
# - '8080:8000' # Use 8080 instead of 80
4. Image Pull Failed
Error:
Error response from daemon: pull access denied for ghcr.io/ls1intum/edutelligence/atlasml
Solution:
# Login to GitHub Container Registry
echo $GITHUB_TOKEN | docker login ghcr.io -u USERNAME --password-stdin
# Pull image manually
docker pull ghcr.io/ls1intum/edutelligence/atlasml:main
# Restart
docker-compose -f docker-compose.prod.yml up -d
Container Starts But Unhealthy
Symptom:
docker ps
# STATUS: Up 2 minutes (unhealthy)
Diagnosis:
# Check health status
docker inspect atlasml | jq '.[0].State.Health'
# View health check logs
docker inspect atlasml | jq '.[0].State.Health.Log'
# Test health endpoint manually
curl http://localhost/api/v1/health
Solutions:
1. Health Check Timeout
If health check takes >10s, increase timeout:
# docker-compose.prod.yml
healthcheck:
timeout: 30s # Increase from 10s
2. Application Not Ready
If application is slow to start, increase start_period:
healthcheck:
start_period: 30s # Increase from 10s
3. Weaviate Connectivity
Test from container:
docker exec atlasml curl http://${WEAVIATE_HOST}:${WEAVIATE_PORT}/v1/.well-known/ready
If fails, check network connectivity and Weaviate status.
Connection Issues
Weaviate Connection Failed
Symptom:
WeaviateConnectionError: Could not connect to Weaviate at https://your-weaviate-domain.com
Diagnosis:
# 1. Check if Weaviate is accessible
curl -H "Authorization: Bearer YOUR_WEAVIATE_API_KEY" https://your-weaviate-domain.com/v1/.well-known/ready
# 2. Check Weaviate service status on Weaviate server
# SSH to the Weaviate server and check:
docker ps | grep weaviate
docker logs weaviate
# 3. Test from AtlasML container
docker exec atlasml curl -H "Authorization: Bearer ${WEAVIATE_API_KEY}" ${WEAVIATE_HOST}/v1/.well-known/ready
# 4. Check network
docker network inspect shared-network
Solutions:
If Weaviate Not Accessible
# Check DNS resolution
nslookup your-weaviate-domain.com
# Check if Weaviate server is reachable
ping your-weaviate-domain.com
# Verify Weaviate API key is correct in .env
cat /opt/atlasml/.env | grep WEAVIATE_API_KEY
# Restart AtlasML with updated configuration
docker-compose -f docker-compose.prod.yml restart
If Weaviate Server Down
SSH to the Weaviate server and check the service:
# Check Weaviate status
cd /path/to/edutelligence/weaviate
docker-compose ps
# View Weaviate logs
docker-compose logs weaviate
# Restart if needed
docker-compose restart weaviate
# Verify it's accessible
curl -H "Authorization: Bearer YOUR_API_KEY" https://your-weaviate-domain.com/v1/.well-known/ready
If Host Resolution Issue
# Check WEAVIATE_HOST value
docker exec atlasml printenv WEAVIATE_HOST
# If using service name, ensure on same Docker network
# If using localhost from container, use host.docker.internal
# Update .env
WEAVIATE_HOST=host.docker.internal
OpenAI API Connection Failed
Symptom:
OpenAI API Error: Authentication failed
Diagnosis:
# 1. Check API key is set
docker exec atlasml printenv OPENAI_API_KEY
# 2. Test API directly
curl https://${OPENAI_API_URL}/openai/deployments \
-H "api-key: ${OPENAI_API_KEY}"
Solutions:
Invalid API Key
# Verify key in Azure Portal
# Azure Portal → Azure OpenAI → Keys and Endpoint
# Update .env
OPENAI_API_KEY=correct-key-from-azure
# Restart
docker-compose -f docker-compose.prod.yml restart atlasml
Wrong URL
# Verify endpoint in Azure Portal
# Should be: https://{resource-name}.openai.azure.com
# Update .env
OPENAI_API_URL=https://correct-resource.openai.azure.com
# Restart
docker-compose -f docker-compose.prod.yml restart atlasml
Network/Firewall Block
# Test connectivity from server
curl -I https://your-resource.openai.azure.com
# If blocked, configure firewall to allow HTTPS to *.openai.azure.com
API Errors
401 Unauthorized
Symptom:
curl http://localhost/api/v1/competency/suggest
# {"detail":"Invalid API key"}
Diagnosis:
# 1. Check API keys configured
docker exec atlasml printenv ATLAS_API_KEYS
# 2. Verify format (must be comma-separated)
echo $ATLAS_API_KEYS
# Should be: key1,key2 (comma-separated, no brackets)
Solutions:
Missing Authorization Header
# ❌ Bad - No header
curl http://localhost/api/v1/competency/suggest
# ✅ Good - With Authorization header
curl -H "Authorization: your-api-key" http://localhost/api/v1/competency/suggest
Wrong Key Format in .env
# ❌ Bad - Incorrect formats
ATLAS_API_KEYS=["key1","key2"] # JSON array (not supported)
ATLAS_API_KEYS=[key1,key2] # Brackets
ATLAS_API_KEYS="key1, key2" # Spaces around commas
# ✅ Good - Comma-separated
ATLAS_API_KEYS=key1,key2
# Fix and restart
docker-compose -f docker-compose.prod.yml restart atlasml
Key Mismatch
# Verify Artemis is using correct key
# Check Artemis configuration: application-prod.yml
# atlas.atlasml.api-key should match one of ATLAS_API_KEYS
# Update either AtlasML or Artemis to match
422 Unprocessable Entity
Symptom:
{
"detail": [
{
"loc": ["body", "course_id"],
"msg": "field required",
"type": "value_error.missing"
}
]
}
Cause: Request body doesn't match expected schema
Solution:
# View API documentation
open http://localhost/docs
# Fix request to include all required fields
curl -X POST http://localhost/api/v1/competency/suggest \
-H "Authorization: your-key" \
-H "Content-Type: application/json" \
-d '{
"description": "Python programming",
"course_id": 1
}'
500 Internal Server Error
Symptom:
{"detail":"Internal server error"}
Diagnosis:
# Check application logs
docker logs atlasml --tail 100
# Look for stack traces
docker logs atlasml 2>&1 | grep -A 20 "ERROR"
# Check Sentry (if configured)
# Visit Sentry dashboard for detailed error info
Common Causes:
Database Error
# Check Weaviate connectivity
docker exec atlasml curl http://${WEAVIATE_HOST}:${WEAVIATE_PORT}/v1/.well-known/ready
# Check Weaviate logs
docker logs weaviate
OpenAI API Error
# Check OpenAI quota
# Azure Portal → Azure OpenAI → Usage
# Check rate limits
# If exceeded, wait or upgrade plan
Memory Issue
# Check memory usage
docker stats atlasml
# If near limit, increase memory
Performance Issues
Slow Response Times
Symptom: Requests take >5 seconds
Diagnosis:
# Measure response time
time curl -X POST http://localhost/api/v1/competency/suggest \
-H "Authorization: test" \
-H "Content-Type: application/json" \
-d '{"description":"test","course_id":1}'
# Check resource usage
docker stats atlasml
Solutions:
High CPU Usage
# Check CPU
docker stats atlasml --no-stream
# If consistently >80%, scale up:
# Option 1: Increase CPU limit
# docker-compose.prod.yml:
# deploy:
# resources:
# limits:
# cpus: '4.0' # Increase from 2.0
# Option 2: Scale horizontally (multiple instances)
docker-compose -f docker-compose.prod.yml up -d --scale atlasml=3
High Memory Usage
# Check memory
docker stats atlasml --no-stream
# If near limit, increase memory
# docker-compose.prod.yml:
# deploy:
# resources:
# limits:
# memory: 4G # Increase from 2G
Large Weaviate Collection
# Check collection size
docker exec weaviate curl http://localhost:8080/v1/schema
# If very large (>100k objects), consider:
# 1. Archiving old data
# 2. Filtering queries by course_id
# 3. Optimizing Weaviate configuration
OpenAI API Latency
# Benchmark OpenAI API
time curl https://${OPENAI_API_URL}/openai/deployments/... \
-H "api-key: ${OPENAI_API_KEY}"
# If slow, check:
# 1. API region (use closest region)
# 2. Rate limiting
# 3. Consider local embeddings for non-critical queries
High Memory Usage
Symptom:
docker stats atlasml
# MEM USAGE: 1.8GB / 2GB (90%)
Diagnosis:
# Monitor over time
docker stats atlasml --no-stream
# Check for memory leaks (if usage grows continuously)
# Run for 1 hour and compare
Solutions:
Increase Memory Limit
# docker-compose.prod.yml
deploy:
resources:
limits:
memory: 4G # Increase from 2G
Optimize Application
# Check if caching too much data
# Review recent code changes
# Look for memory leaks in logs
# Restart to clear memory (temporary fix)
docker-compose -f docker-compose.prod.yml restart atlasml
Add Memory Monitoring
# Set up alert when memory >80%
# See Monitoring Guide for details
Deployment Issues
New Version Not Deploying
Symptom: Container running but still old version
Diagnosis:
# Check image tag
docker inspect atlasml | jq '.[0].Config.Image'
# Check when image was pulled
docker inspect atlasml | jq '.[0].Created'
# Check available images
docker images | grep atlasml
Solutions:
Force Pull New Image
# Pull latest
docker-compose -f docker-compose.prod.yml pull
# Stop and remove container
docker-compose -f docker-compose.prod.yml down
# Start with new image
docker-compose -f docker-compose.prod.yml up -d
# Verify new version
docker logs atlasml | grep "Started"
Clear Image Cache
# Remove old images
docker rmi ghcr.io/ls1intum/edutelligence/atlasml:old-tag
# Pull specific version
IMAGE_TAG=v1.2.0 docker-compose -f docker-compose.prod.yml pull
# Restart
IMAGE_TAG=v1.2.0 docker-compose -f docker-compose.prod.yml up -d
Deployment Rollback Needed
Scenario: New version has critical bug
Quick Rollback:
cd /opt/atlasml
# Set previous version
echo "IMAGE_TAG=v1.1.0" > .env.temp
cat .env >> .env.temp
mv .env.temp .env
# Pull and restart
docker-compose -f docker-compose.prod.yml pull
docker-compose -f docker-compose.prod.yml up -d
# Verify
docker logs atlasml
curl http://localhost/api/v1/health
Data Issues
Weaviate Data Lost
Symptom: All competencies missing after restart
Diagnosis:
# Check volume exists
docker volume ls | grep weaviate
# Inspect volume
docker volume inspect weaviate-data
# Check mount point
docker inspect weaviate | jq '.[0].Mounts'
Solutions:
Volume Not Mounted
# compose.weaviate.yaml
# Ensure volume is configured:
services:
weaviate:
volumes:
- weaviate-data:/var/lib/weaviate # Mount volume
volumes:
weaviate-data: # Declare volume
Volume Deleted
# Check if backup exists
ls -lh /path/to/backups/weaviate-*.tar.gz
# Restore from backup
docker-compose -f compose.weaviate.yaml down
sudo tar -xzf weaviate-backup-20250115.tar.gz -C /
docker-compose -f compose.weaviate.yaml up -d
Create Regular Backups
# Backup script
#!/bin/bash
docker-compose -f compose.weaviate.yaml stop
sudo tar -czf weaviate-backup-$(date +%Y%m%d).tar.gz \
/var/lib/docker/volumes/weaviate-data
docker-compose -f compose.weaviate.yaml start
# Run daily via cron
0 2 * * * /opt/atlasml/backup-weaviate.sh
Network Issues
AtlasML Not Reachable from Artemis
Symptom: Artemis cannot connect to AtlasML
Diagnosis:
# Test from Artemis server
curl http://atlasml-server:80/api/v1/health
# Check firewall
sudo ufw status
Solutions:
Firewall Blocking
# Allow from Artemis server
sudo ufw allow from ARTEMIS_IP to any port 80
# Or allow all (less secure)
sudo ufw allow 80/tcp
# Verify
sudo ufw status numbered
Wrong Hostname/IP in Artemis
# application-prod.yml in Artemis
atlas:
atlasml:
base-url: https://correct-hostname-or-ip # Fix this
api-key: your-api-key
DNS Issue
# Test DNS resolution from Artemis
nslookup atlasml-server.company.com
# If fails, add to /etc/hosts
echo "192.168.1.100 atlasml-server" | sudo tee -a /etc/hosts
SSL/TLS Certificate Issues
Symptom: HTTPS connection fails
Diagnosis:
# Test SSL
curl -v https://atlasml.company.com/api/v1/health
# Check certificate
openssl s_client -connect atlasml.company.com:443 -servername atlasml.company.com
Solutions:
Certificate Expired
# Renew Let's Encrypt certificate
sudo certbot renew
# Reload Nginx
sudo systemctl reload nginx
Self-Signed Certificate
# If using self-signed, Artemis must trust it
# Copy cert to Artemis server
scp /etc/ssl/certs/atlasml.crt artemis-server:/usr/local/share/ca-certificates/
ssh artemis-server 'sudo update-ca-certificates'
Troubleshooting Checklist
When encountering an issue:
-
Check service health
docker ps
docker logs atlasml --tail 50 -
Test connectivity
curl http://localhost/api/v1/health
curl http://localhost:8085/v1/.well-known/ready -
Verify configuration
docker exec atlasml env | grep -E "(WEAVIATE|OPENAI|ATLAS)" -
Check resources
docker stats atlasml --no-stream
df -h -
Review logs
docker logs atlasml 2>&1 | grep ERROR
docker logs weaviate 2>&1 | grep ERROR -
Check Sentry (if configured)
- Visit Sentry dashboard
- Filter by environment and time
Getting Help
Gather Information
Before reporting issues, collect:
# System info
uname -a
docker --version
docker-compose --version
# Container status
docker ps -a | grep atlasml
# Recent logs
docker logs atlasml --tail 100 > atlasml-logs.txt
# Configuration (mask secrets!)
docker exec atlasml env | grep -E "(WEAVIATE|ATLAS|ENV)" | sed 's/=.*/=***/' > config.txt
# Resource usage
docker stats atlasml --no-stream > stats.txt
Report Issue
Include in report:
- Description: What were you trying to do?
- Expected: What should happen?
- Actual: What actually happened?
- Environment:
- OS and version
- Docker version
- AtlasML version/image tag
- Logs: Relevant error messages
- Steps to reproduce: How to recreate the issue
Common Error Messages
| Error | Cause | Solution |
|---|---|---|
WeaviateConnectionError | Weaviate not running | Start Weaviate |
401 Unauthorized | Invalid/missing API key | Check Authorization header |
422 Unprocessable Entity | Invalid request body | Check request schema |
500 Internal Server Error | Server-side error | Check logs |
Connection refused | Service not running | Start service |
Address already in use | Port conflict | Change port or stop other service |
No such container | Container not running | Start container |
pull access denied | Image not accessible | Check credentials |
Next Steps
- Installation: Reinstall if needed
- Configuration: Verify configuration
- Monitoring: Set up monitoring to catch issues early
- Deployment: Review deployment process
Resources
- Docker Troubleshooting: https://docs.docker.com/config/daemon/troubleshoot/
- Weaviate Troubleshooting: https://weaviate.io/developers/weaviate/installation/troubleshooting
- FastAPI Documentation: https://fastapi.tiangolo.com/
- GitHub Issues: https://github.com/ls1intum/edutelligence/issues