Troubleshooting Helios
How to Re-Deploy Helios from GitHub Actions
To trigger a re-deployment of Helios without accessing the server manually, follow these steps using GitHub Actions:
- Open the Helios GitHub Repository 
- Go to the Actions Tab 
- Select the Deployment Workflow - Choose one of the deployment workflows: - Build and Deploy to Prod 
- Build and Deploy to Staging 
 
 
- Locate the “Deploy” step within the job summary. 
- Rerun the “Deploy” job - It will bring down the current services with - docker compose down.
- A new - .envfile is generated dynamically using GitHub Environment secrets and variables (e.g.,- POSTGRES_PASSWORD,- NATS_URL, etc.).
- The services are brought back up using - docker compose up -d.
- Finally, - docker restart nginxis executed to reload routing and pick up any changes (e.g., environment updates or renewed TLS certificates).
 
Updating SSL/TLS Certificates
Note
The Helios Docker Compose file includes all services (application-server, webhook-listener, client, database, NATS, etc.) except nginx. Nginx run as a separate container and attached to the same network (helios-network)
docker run -d \
  --name nginx \
  --restart unless-stopped \
  -p 80:80 -p 443:443 \
  -v /etc/nginx/conf/nginx.conf:/etc/nginx/nginx.conf:ro \
  -v /var/lib/rbg-cert:/var/lib/rbg-cert:ro \
  --net helios-network \
  nginx:latest
Once Nginx is running on the helios-network, it will proxy traffic to the Helios services defined in Docker Compose.
We are using SSL/TLS certificates are provided by the TUM IT department and are valid for 1 year. These certificates are automatically renewed by ITG and exposed via symlinks under /var/lib/rbg-cert/live/. Nginx is configured to use these paths directly in both staging and production environments. For more details and the relevant nginx.conf certificate paths, refer to the Production Setup Guide -> Additional Containers -> nginx.
Warning
The only required action is to restart the nginx container once a year when the certificates are renewed:
docker restart nginx
NATS Webhook Data Cleanup
Helios uses NATS to buffer incoming webhook events. These messages are stored persistently in a Docker volume named helios_nats-data. Over time, old webhook data may accumulate and can be safely removed if it is no longer needed.
- List Docker Volumes - To see the NATS data volume, run: - docker volume ls- Look for a volume named helios_nats-data. 
- Stop the Helios - Before deleting the NATS data volume, stop the entire Helios stack to avoid data corruption: - cd /opt/helios docker compose -f compose.prod.yaml down --remove-orphans --rmi all 
- Remove the NATS Data Volume - To delete the NATS data volume, run: - docker volume rm helios_nats-data- This command will remove the persistent storage for NATS, effectively clearing all buffered webhook data. 
- Restart Helios - After cleaning up the NATS data, you can restart the Helios stack: - cd /opt/helios docker compose -f compose.prod.yaml --env-file=.env up --pull=always -d 
If Helios Stops Processing Webhook Events
In some cases, the Helios application-server may stop handling incoming GitHub webhook events. This is often caused by the NATS consumer being deleted due to inactivity. Below is a brief overview of the issue, initial diagnostic steps, and remediation.
Issue Overview
- The application-server subscribes to a NATS stream (“github”) to receive webhook messages produced by the webhook-listener. 
- If the NATS consumer is inactive (does not pull new messages or acknowledge an event) for a certain period, NATS may delete the consumer automatically. 
- When the consumer is deleted, you will see errors such as: - application-server-1 | 2025-02-07T15:51:06.956Z ERROR 1 --- [Helios] [pool-3-thread-1] i.n.client.impl.ErrorListenerLoggerImpl : pullStatusError, Connection: 13, Subscription: 615556999, Consumer Name: xy23djLX, Status:Status{code=409, message='Consumer Deleted'} application-server-1 | 2025-02-07T15:53:59.905Z ERROR 1 --- [Helios] [pool-3-thread-1] i.n.client.impl.ErrorListenerLoggerImpl : heartbeatAlarm, Connection: 13, Subscription: 94524147, Consumer Name: xy23djLX, lastStreamSequence: 58810, lastConsumerSequence: 40771 
- After this error, no new webhook events will be processed until the consumer is re-created. 
Initial Diagnostic Steps
- Verify GitHub App Webhooks - Go to the GitHub App settings page: https://github.com/organizations/ls1intum/settings/apps/helios-aet/advanced - docker logs -f application-server
Under
Advancedcheck that webhooks are being delivered successfully.
If webhook deliveries are failing, inspect the HTTP response codes and payloads.
Failed deliveries may indicate issues with the
webhook-listener.
- Check Logs for Event Receipt 
webhook-listener Logs: Confirm that the
webhook-listenercontainer is receiving GitHub events. You should see log entries like below:INFO: Published message to github.ls1intum.Helios.workflow_run: PubAck(stream='github', seq=433413, domain=None, duplicate=None)
application-server Logs: Verify whether those same events appear in
application-serverlogs. If the listener shows events but theapplication-serverdoes not, that indicates a NATS delivery issue.
- Search for Consumer Deleted Errors 
In the
application-serverlogs, search forConsumer Deleted,pullStatusErrororheartbeatAlarmto confirm that the NATS consumer was removed. This is a clear sign that the server cannot receive new messages.
Remediation Steps
If you have confirmed that the application-server is not receiving events due to a deleted consumer, perform the following steps to restore processing:
- Stop Helios - cd /opt/helios docker compose -f compose.prod.yaml down 
- Remove the NATS Data Volume
- This will clear the NATS stream and allow the consumer to be recreated: - docker volume rm helios_nats-data
 
- Restart Helios - cd /opt/helios docker compose -f compose.prod.yaml --env-file=.env up --pull=always -d 
This sequence will clear any stale NATS state and allow a fresh consumer to be created.
For more details, refer to the issue and PR that introduced durable consumer support:
Disaster Recovery
In the event that the Helios database volume is accidentally removed, there is no direct backup available. To recover:
- Edit the .env File - Before starting Helios, open the - .envfile (located in- /opt/helios) and set:- DATA_SYNC_RUN_ON_STARTUP=true - This ensures that Helios will run the full data synchronization process on startup. 
- Start Helios for Data Synchronization - Because - DATA_SYNC_RUN_ON_STARTUPis- true, Helios will repopulate repository metadata and configuration from GitHub. The sync can take up to 30–40 minutes, depending on the number and size of repositories.- cd /opt/helios docker compose -f compose.prod.yaml --env-file=.env up --pull=always -d 
- Reconfigure Repository Settings - After synchronization completes, each repository admin must log in to the Helios UI and reapply repository-specific settings: - Repository settings: Workflow UI grouping 
- Repository settings: Test artifact labeling 
- Environment settings: Setting up environment URLs, labels, etc. 
 - This manual step typically takes 5 minutes per repository. 
- Disable Startup Sync - Once all repositories are reconfigured, return to the .env file and set: - DATA_SYNC_RUN_ON_STARTUP=false - This prevents redundant full syncs on subsequent restarts. 
Connecting to the Database via SSH Tunnel
To connect to the PostgreSQL database running inside the Helios host from your local machine, you can establish an SSH tunnel. First, ensure that your SSH config defines a host alias (e.g., “helios”) pointing to the server. Then run:
ssh -N -L 5433:localhost:5432 helios
- -N: Do not execute a remote command—just forward ports.
- -L 5433:localhost:5432: Forward local port- 5433to remote port- 5432on the Helios host.
- helios: SSH host alias (or replace with- username@hostnameif no alias is defined).
Once the tunnel is established, connect locally to port 5433 as if you were connecting directly to the database:
psql -h localhost -p 5433 -U <db_user> -d <db_name>
Replace <db_user> and <db_name> with the appropriate PostgreSQL username and database name. You can view the actual credentials (username, password, database name) in the .env file under /opt/helios.
Deployment User
Helios deployments on GitHub Actions are performed by a dedicated user account named github_deployment. This user was created following the instructions in the ls1intum GitHub organization’s repository (see https://github.com/ls1intum/.github/). GitHub Actions uses github_deployment to push updated images and configuration into the Helios environment, so ensure that:
- The - github_deploymentuser has the correct SSH keys and permissions configured on the Helios host.
- Any workflow secrets referencing deployment keys are up to date. 
- The home directory for - github_deployment(e.g.,- /home/github_deployment/.ssh/) contains the authorized private key.
- The - github_deploymentuser is a member of the- dockergroup so it can run Docker commands during deployment.
Useful Environment Variables
Helios relies on a set of environment variables defined in the .env file (located in /opt/helios) to configure certain runtime behaviors and image tags. Common variables include:
# Enable or disable sending emails (true/false)
EMAIL_ENABLED=true
# Image tags for various Helios services
CLIENT_IMAGE_TAG=latest
APPLICATION_SERVER_IMAGE_TAG=latest
NOTIFICATION_SERVER_IMAGE_TAG=latest
WEBHOOK_LISTENER_IMAGE_TAG=latest
KEYCLOAK_IMAGE_TAG=latest
# Run data synchronization on startup (true/false)
DATA_SYNC_RUN_ON_STARTUP=false
# Time (in minutes) after which an inactive consumer is removed
NATS_CONSUMER_INACTIVE_THRESHOLD_MINUTES=30
# Time (in seconds) that NATS waits for a message ACK before resending
NATS_CONSUMER_ACK_WAIT_SECONDS=60
- EMAIL_ENABLED: Controls whether the Helios application sends notification emails. Set to - falseto disable email functionality.
- Image Tag Variables: - latestrefers to the most recent release.
- For the latest - stagingbuild, set to- staging.
- For specific staging builds, use the format: - sha-<short-sha>-staging
 
- DATA_SYNC_RUN_ON_STARTUP: If set to - true, Helios will run the data synchronization process each time it starts up. Set to- falseto disable automatic sync on boot.
- NATS_CONSUMER_INACTIVE_THRESHOLD_MINUTES: Defines how long (in minutes) a NATS consumer can be inactive before it is automatically removed. Default is 30 minutes. 
- NATS_CONSUMER_ACK_WAIT_SECONDS: Specifies the time (in seconds) that NATS waits for a message acknowledgment before resending it. Default is 60 seconds. 
Useful Server Commands
- Running the Helios - To start the entire Helios application stack in detached mode and always pull the latest images: - cd /opt/helios docker compose -f compose.prod.yaml --env-file=.env up --pull=always -d 
- Stopping Helios - To stop and remove all Helios containers, networks, and images (including orphaned containers): - cd /opt/helios docker compose -f compose.prod.yaml down --remove-orphans --rmi all 
- Viewing All Container Logs - To follow logs for every container defined in the Compose file: - cd /opt/helios docker compose -f compose.prod.yaml logs -f 
- Viewing an Individual Container’s Logs - List running containers to identify the container name: - docker ps- Then, view logs for a specific container (replace container_name with the actual name): - docker logs -f container_name
- Inspecting Disk and Volume Usage - To view Docker’s disk usage, including local volumes and their sizes: - docker system df -v- Example output: - Local Volumes space usage: VOLUME NAME LINKS SIZE helios_nats-data 1 5.91GB helios_db-data 1 14.87GB