CalypsoAI Monitoring & Metrics
For On-Prem deployments, customers are responsible for their own monitoring of the deployed solution. In terms of monitoring and metrics, CalypsoAI exposes a wide set of internal metrics that you can leverage for observability and troubleshooting:
1. Prometheus-Compatible Metrics for cai-scanner
- Metrics are available at the /metrics endpoint.
- Common tools like Prometheus or Dynatrace can scrape this data.
- Example metrics include:
- Number of running/waiting requests (vllm:num_requests_running, vllm:num_requests_waiting)
- GPU cache metrics and memory usage
- Python garbage collection and memory stats
- CPU and process metrics (process_cpu_seconds_total, process_resident_memory_bytes)
2. GPU Monitoring
- CalypsoAI recommends deploying the Nvidia DCGM Exporter as a DaemonSet for GPU telemetry.
- This allows collection of detailed GPU performance data.
3. Moderator Component Telemetry
- Similar Prometheus-compatible metrics are available via the cai-moderator service.
- These include:
- Thread and DB connection availability
- Worker processing times
- General Python and process metrics
4. Dash-boarding and Visualization
- CalypsoAI can provide a pre-built Grafana dashboard (e.g., vllm-dashboard.json) to visualize these metrics effectively.
- Prometheus and Grafana can be set up using Helm charts, and we can guide customers through this process as needed.
5. Legacy Metrics (For Reference)
- The legacy /backend/v1/app/metrics endpoint is deprecated for modern deployments using custom scanners but may still report some workerStats.
6. Automation and Response
- While CalypsoAI does not currently offer a fully out-of-the-box automated remediation system for incidents, the telemetry infrastructure provided is designed to integrate with existing incident response tooling and workflows.
- Customers are encouraged to use alerting features in Grafana, Dynatrace, or other tools to automate actions as needed.