Skip to content
  • There are no suggestions because the search field is empty.

CalypsoAI Monitoring & Metrics

For On-Prem deployments, customers are responsible for their own monitoring of the deployed solution. In terms of monitoring and metrics, CalypsoAI exposes a wide set of internal metrics that you can leverage for observability and troubleshooting:

1. Prometheus-Compatible Metrics for cai-scanner

  • Metrics are available at the /metrics endpoint.
  • Common tools like Prometheus or Dynatrace can scrape this data.
  • Example metrics include:
    • Number of running/waiting requests (vllm:num_requests_running, vllm:num_requests_waiting)
    • GPU cache metrics and memory usage
    • Python garbage collection and memory stats
    • CPU and process metrics (process_cpu_seconds_total, process_resident_memory_bytes)

2. GPU Monitoring

  • CalypsoAI recommends deploying the Nvidia DCGM Exporter as a DaemonSet for GPU telemetry.
  • This allows collection of detailed GPU performance data.

3. Moderator Component Telemetry

  • Similar Prometheus-compatible metrics are available via the cai-moderator service.
  • These include:
    • Thread and DB connection availability
    • Worker processing times
    • General Python and process metrics

4. Dash-boarding and Visualization

  • CalypsoAI can provide a pre-built Grafana dashboard (e.g., vllm-dashboard.json) to visualize these metrics effectively.
  • Prometheus and Grafana can be set up using Helm charts, and we can guide customers through this process as needed.

5. Legacy Metrics (For Reference)

  • The legacy /backend/v1/app/metrics endpoint is deprecated for modern deployments using custom scanners but may still report some workerStats.

6. Automation and Response

  • While CalypsoAI does not currently offer a fully out-of-the-box automated remediation system for incidents, the telemetry infrastructure provided is designed to integrate with existing incident response tooling and workflows.
  • Customers are encouraged to use alerting features in Grafana, Dynatrace, or other tools to automate actions as needed.