Troubleshooting Helios

How to Re-Deploy Helios from GitHub Actions 

To trigger a re-deployment of Helios without accessing the server manually, follow these steps using GitHub Actions:

Open the Helios GitHub Repository
Go to the Actions Tab
Select the Deployment Workflow
- Choose one of the deployment workflows:
  - Build and Deploy to Prod
  - Build and Deploy to Staging
Locate the “Deploy” step within the job summary.
Rerun the “Deploy” job
- It will bring down the current services with docker compose down.
- A new .env file is generated dynamically using GitHub Environment secrets and variables (e.g., POSTGRES_PASSWORD, NATS_URL, etc.).
- The services are brought back up using docker compose up -d.
- Finally, docker restart nginx is executed to reload routing and pick up any changes (e.g., environment updates or renewed TLS certificates).

Updating SSL/TLS Certificates 

Note

The Helios Docker Compose file includes all services (application-server, webhook-listener, client, database, NATS, etc.) except nginx. Nginx run as a separate container and attached to the same network (helios-network)

docker run -d \
  --name nginx \
  --restart unless-stopped \
  -p 80:80 -p 443:443 \
  -v /etc/nginx/conf/nginx.conf:/etc/nginx/nginx.conf:ro \
  -v /var/lib/rbg-cert:/var/lib/rbg-cert:ro \
  --net helios-network \
  nginx:latest

Once Nginx is running on the helios-network, it will proxy traffic to the Helios services defined in Docker Compose.

We are using SSL/TLS certificates are provided by the TUM IT department and are valid for 1 year. These certificates are automatically renewed by ITG and exposed via symlinks under /var/lib/rbg-cert/live/. Nginx is configured to use these paths directly in both staging and production environments. For more details and the relevant nginx.conf certificate paths, refer to the Production Setup Guide -> Additional Containers -> nginx.

Warning

The only required action is to restart the nginx container once a year when the certificates are renewed:

docker restart nginx

NATS Webhook Data Cleanup 

Helios uses NATS to buffer incoming webhook events. These messages are stored persistently in a Docker volume named helios_nats-data. Over time, old webhook data may accumulate and can be safely removed if it is no longer needed.

List Docker Volumes

To see the NATS data volume, run:
```
docker volume ls
```
Look for a volume named helios_nats-data.
Stop the Helios

Before deleting the NATS data volume, stop the entire Helios stack to avoid data corruption:
```
cd /opt/helios
docker compose -f compose.prod.yaml down --remove-orphans --rmi all
```
Remove the NATS Data Volume

To delete the NATS data volume, run:
```
docker volume rm helios_nats-data
```
This command will remove the persistent storage for NATS, effectively clearing all buffered webhook data.

Restart Helios

After cleaning up the NATS data, you can restart the Helios stack:

cd /opt/helios
docker compose -f compose.prod.yaml --env-file=.env up --pull=always -d

If Helios Stops Processing Webhook Events 

In some cases, the Helios application-server may stop handling incoming GitHub webhook events. This is often caused by the NATS consumer being deleted due to inactivity. Below is a brief overview of the issue, initial diagnostic steps, and remediation.

Issue Overview 

The application-server subscribes to a NATS stream (“github”) to receive webhook messages produced by the webhook-listener.
If the NATS consumer is inactive (does not pull new messages or acknowledge an event) for a certain period, NATS may delete the consumer automatically.

When the consumer is deleted, you will see errors such as:

application-server-1  | 2025-02-07T15:51:06.956Z ERROR 1 --- [Helios] [pool-3-thread-1] i.n.client.impl.ErrorListenerLoggerImpl  : pullStatusError, Connection: 13, Subscription: 615556999, Consumer Name: xy23djLX, Status:Status{code=409, message='Consumer Deleted'}
application-server-1  | 2025-02-07T15:53:59.905Z ERROR 1 --- [Helios] [pool-3-thread-1] i.n.client.impl.ErrorListenerLoggerImpl  : heartbeatAlarm, Connection: 13, Subscription: 94524147, Consumer Name: xy23djLX, lastStreamSequence: 58810, lastConsumerSequence: 40771

After this error, no new webhook events will be processed until the consumer is re-created.

Initial Diagnostic Steps 

Verify GitHub App Webhooks

Go to the GitHub App settings page: https://github.com/organizations/ls1intum/settings/apps/helios-aet/advanced
```
docker logs -f application-server
```

Under Advanced check that webhooks are being delivered successfully.

If webhook deliveries are failing, inspect the HTTP response codes and payloads.

Failed deliveries may indicate issues with the webhook-listener.

Check Logs for Event Receipt

webhook-listener Logs: Confirm that the webhook-listener container is receiving GitHub events. You should see log entries like below:
INFO:   Published message to github.ls1intum.Artemis.workflow_run: PubAck(stream='github', seq=433413, domain=None, duplicate=None)
application-server Logs: Verify whether those same events appear in application-server logs. If the listener shows events but the application-server does not, that indicates a NATS delivery issue.

Search for Consumer Deleted Errors

In the application-server logs, search for Consumer Deleted, pullStatusError or heartbeatAlarm to confirm that the NATS consumer was removed. This is a clear sign that the server cannot receive new messages.

Remediation Steps 

If you have confirmed that the application-server is not receiving events due to a deleted consumer, perform the following steps to restore processing:

Stop Helios

cd /opt/helios
docker compose -f compose.prod.yaml down

Remove the NATS Data Volume
This will clear the NATS stream and allow the consumer to be recreated:
docker volume rm helios_nats-data

Restart Helios

cd /opt/helios
docker compose -f compose.prod.yaml --env-file=.env up --pull=always -d

This sequence will clear any stale NATS state and allow a fresh consumer to be created.

For more details, refer to the issue and PR that introduced durable consumer support:

Disaster Recovery 

In the event that the Helios database volume is accidentally removed, there is no direct backup available. To recover:

Edit the .env File

Before starting Helios, open the .env file (located in /opt/helios) and set:
```
DATA_SYNC_RUN_ON_STARTUP=true
```
This ensures that Helios will run the full data synchronization process on startup.
Start Helios for Data Synchronization

Because DATA_SYNC_RUN_ON_STARTUP is true, Helios will repopulate repository metadata and configuration from GitHub. The sync can take up to 30–40 minutes, depending on the number and size of repositories.
```
cd /opt/helios
docker compose -f compose.prod.yaml --env-file=.env up --pull=always -d
```
Reconfigure Repository Settings

After synchronization completes, each repository admin must log in to the Helios UI and reapply repository-specific settings:
- Repository settings: Workflow UI grouping
- Repository settings: Test artifact labeling
- Environment settings: Setting up environment URLs, labels, etc.
This manual step typically takes 5 minutes per repository.
Disable Startup Sync

Once all repositories are reconfigured, return to the .env file and set:
```
DATA_SYNC_RUN_ON_STARTUP=false
```
This prevents redundant full syncs on subsequent restarts.

Connecting to the Database via SSH Tunnel 

To connect to the PostgreSQL database running inside the Helios host from your local machine, you can establish an SSH tunnel. First, ensure that your SSH config defines a host alias (e.g., “helios”) pointing to the server. Then run:

ssh -N -L 5433:localhost:5432 helios

-N: Do not execute a remote command—just forward ports.
-L 5433:localhost:5432: Forward local port 5433 to remote port 5432 on the Helios host.
helios: SSH host alias (or replace with username@hostname if no alias is defined).

Once the tunnel is established, connect locally to port 5433 as if you were connecting directly to the database:

psql -h localhost -p 5433 -U <db_user> -d <db_name>

Replace <db_user> and <db_name> with the appropriate PostgreSQL username and database name. You can view the actual credentials (username, password, database name) in the .env file under /opt/helios.

Deployment User 

Helios deployments on GitHub Actions are performed by a dedicated user account named github_deployment. This user was created following the instructions in the ls1intum GitHub organization’s repository (see https://github.com/ls1intum/.github/). GitHub Actions uses github_deployment to push updated images and configuration into the Helios environment, so ensure that:

The github_deployment user has the correct SSH keys and permissions configured on the Helios host.
Any workflow secrets referencing deployment keys are up to date.
The home directory for github_deployment (e.g., /home/github_deployment/.ssh/) contains the authorized private key.
The github_deployment user is a member of the docker group so it can run Docker commands during deployment.

Useful Environment Variables 

Helios relies on a set of environment variables defined in the .env file (located in /opt/helios) to configure certain runtime behaviors and image tags. Common variables include:

# Enable or disable sending emails (true/false)
EMAIL_ENABLED=true

# Image tags for various Helios services
CLIENT_IMAGE_TAG=latest
APPLICATION_SERVER_IMAGE_TAG=latest
NOTIFICATION_SERVER_IMAGE_TAG=latest
WEBHOOK_LISTENER_IMAGE_TAG=latest
KEYCLOAK_IMAGE_TAG=latest

# Run data synchronization on startup (true/false)
DATA_SYNC_RUN_ON_STARTUP=false

# Time (in minutes) after which an inactive consumer is removed
NATS_CONSUMER_INACTIVE_THRESHOLD_MINUTES=30
# Time (in seconds) that NATS waits for a message ACK before resending
NATS_CONSUMER_ACK_WAIT_SECONDS=60

EMAIL_ENABLED: Controls whether the Helios application sends notification emails. Set to false to disable email functionality.
Image Tag Variables:
- latest refers to the most recent release.
- For the latest staging build, set to staging.
- For specific staging builds, use the format: sha-<short-sha>-staging
DATA_SYNC_RUN_ON_STARTUP: If set to true, Helios will run the data synchronization process each time it starts up. Set to false to disable automatic sync on boot.
NATS_CONSUMER_INACTIVE_THRESHOLD_MINUTES: Defines how long (in minutes) a NATS consumer can be inactive before it is automatically removed. Default is 30 minutes.
NATS_CONSUMER_ACK_WAIT_SECONDS: Specifies the time (in seconds) that NATS waits for a message acknowledgment before resending it. Default is 60 seconds.

Useful Server Commands 

Running the Helios

To start the entire Helios application stack in detached mode and always pull the latest images:
```
cd /opt/helios
docker compose -f compose.prod.yaml --env-file=.env up --pull=always -d
```
Stopping Helios

To stop and remove all Helios containers, networks, and images (including orphaned containers):
```
cd /opt/helios
docker compose -f compose.prod.yaml down --remove-orphans --rmi all
```
Viewing All Container Logs

To follow logs for every container defined in the Compose file:
```
cd /opt/helios
docker compose -f compose.prod.yaml logs -f
```
Viewing an Individual Container’s Logs

List running containers to identify the container name:
```
docker ps
```
Then, view logs for a specific container (replace container_name with the actual name):
```
docker logs -f container_name
```

Inspecting Disk and Volume Usage

To view Docker’s disk usage, including local volumes and their sizes:

docker system df -v

Example output:

Local Volumes space usage:
VOLUME NAME        LINKS     SIZE
helios_nats-data   1         5.91GB
helios_db-data     1         14.87GB