Hermes — self-healing supervisor stack

Multi-service stack supervising ELC AI Agent instances. Spawned from feat/corone-self-healing-agent branch — keb workspace was the reference implementation, hence keb’s commit author identity = Julian (Hermes Self-Healing Port).

Components (3 systemd units)

ServicePortPurposeStatus (2026-05-03)
hermes-webui127.0.0.1:8787Python supervisor + Web UI (uvicorn)activating
hermes-dashboard127.0.0.1:9119Hermes admin dashboard (/root/.local/bin/hermes dashboard)active
hermes-cloudflare(tunnel)Cloudflare Tunnel exposing hermes.corone.monster127.0.0.1:8787activating

Public endpoint

  • https://hermes.corone.monster (via Cloudflare Tunnel — see cloudflared config)
  • nginx vhost: /etc/nginx/sites-enabled/hermes.corone.monster

Code locations

  • Agent: /root/.hermes/hermes-agent/ (Python venv)
  • Web UI: /root/hermes-webui/server.py
  • Dashboard binary: /root/.local/bin/hermes
  • cloudflared: /root/.local/bin/cloudflared

Function

Watches ELC instances (corone-app:3000, kebahagiaan-app:3001):

  • Restart on crash via systemd
  • Captures error.unexpected lifecycle event (added in 1.3.7)
  • Catches process-level uncaughtException + unhandledRejection
  • Dispatches notifications externally
  • e2a1365feat(lifecycle) dispatch error.unexpected on uncaught exception/rejection
  • a6cba6afeat(notifications) error.unexpected event type for process-level crashes

Why it matters

Without Hermes, an ELC crash → 502 → user manually restarts. With Hermes:

  1. Process dies → systemd starts new one
  2. Lifecycle event flagged → Hermes notifies
  3. Dashboard at :9119 shows recent restart history
  4. Cloudflare Tunnel keeps URL alive even if local nginx hiccups

Cross-refs