Monitoring docker-compose

To monitor metrics in self-hosted Databand, you can use local components or external monitoring systems, such as New Relic and Datadog.

Local components

The Databand stack includes two optional components:

  • Use cadvisor metrics for Docker runtime and containers.
  • Use node_exporter to get VM metrics where containers are running.

Databand supports Prometheus and OpenMetrics monitoring by default. You can use bundled or external Prometheus, or other monitoring solutions, to scrape and store these metrics. The bundled Prometheus configuration also has pre-defined targets for some Databand components, such as webserver, tracking-server, and rule-engine. Metrics from these components are available in bundled Prometheus by default after you deploy Databand with docker-compose.

To enable cadvisor and node_exporter:
  • Open custom.env and add mods/monitoring.yml to COMPOSE_FILE variable definition:
    COMPOSE_FILE=docker-compose.yml:mods/monitoring.yml
  • If you have other components that are enabled with the COMPOSE_FILE variable, for example a bundled local PostgreSQL database, use : to separate them from the mods/monitoring.yml string:
    COMPOSE_FILE=docker-compose.yml:mods/local_pg.yml:mods/monitoring.yml
    

You can list available cadvisor and node_exporter metrics by accessing /metrics endpoints from appropriate containers.

To list Databand's webserver, tracking-server, and rule-engine metrics, access the following endpoints from appropriate containers:
  • /api/internal/v1/dbnd_tracking_metrics and /api/internal/v1/dbnd_application_metrics for webserver
  • /api/internal/v1/dbnd_tracking_metrics for tracking-server
  • / for rule-engine

Most Databand metrics have a dbnd_ prefix in the metric name. Python runtime metrics have a flask_ prefix in the metric name.

You can discover all of these metrics in the bundled Prometheus UI.

Container memory usage:
  - alert: ContainerMemoryUsage
    expr: (sum(container_memory_working_set_bytes{name!=""}) by (instance, name) / sum(container_spec_memory_limit_bytes > 0) by (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: high
    annotations:
      summary: Container Memory usage (instance {{ $labels.instance }})
      description: "Container Memory usage is above 80%\n VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Container CPU throttle:
  - alert: ContainerHighThrottleRate
    expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
    for: 2m
    labels:
      severity: high
    annotations:
      summary: Container high throttle rate (instance {{ $labels.instance }})
      description: "Container is being throttled\n VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Databand endpoint response time:
    - alert: ApiResponseTimeTooHightUpdateTaskRunAttempts
      expr: (rate(flask_http_request_duration_seconds_sum{status="200",path="/api/v1/tracking/update_task_run_attempts"}[60s])/rate(flask_http_request_duration_seconds_count{status="200",path="/api/v1/tracking/update_task_run_attempts"}[60s])) > 10 < +Inf
      for: 10s
      labels:
        severity: high
      annotations:
        summary: API /api/v1/tracking/update_task_run_attempts average response time is too high
        description: "Avarange response time for api /api/v1/tracking/update_task_run_attempts is above 10s for the last 10s\n  VALUE = {{ $value }}s\n API = {{ $labels.path }}"

    - alert: ApiResponseTimeTooHightInitRun
      expr: (rate(flask_http_request_duration_seconds_sum{status="200",path="/api/v1/tracking/init_run"}[60s])/rate(flask_http_request_duration_seconds_count{status="200",path="/api/v1/tracking/init_run"}[60s])) > 10 < +Inf
      for: 10s
      labels:
        severity: high
      annotations:
        summary: API /api/v1/tracking/init_run average response time is too high
        description: "Avarange response time for api /api/v1/tracking/init_run is above 10s for the last 10s\n VALUE = {{ $value }}s\n API = {{ $labels.path }}"
Databand access token expiration:
    - alert: DatabandAccessTokenIsAboutToExpire
      expr: ((dbnd_auth_tokens - time()) / (3600 * 24)) <= 7 > 0 # 7 days
      for: 1m
      labels:
        severity: high
      annotations:
        summary: "Databand Access Token {{ $labels.label }} will expire in {{ humanize $value }} days"
        description: "Databand Access Token is about to expire"

Enabling New Relic monitoring

Follow these instructions to enable New Relic monitoring of your docker-compose deployments with your Databand environment.

  1. Before you begin, you must download your JSON configuration file from New Relic. For more information, see the New Relic documentation.
  2. Copy the JSON file into the deployment folder under databand/config/webserver/newrelic.ini. This path is mapped under /etc/config, so the New Relic agent uses it as a configuration.
  3. Enable New Relic at custom.env by using NEW_RELIC_ENABLED=true
    ## custom.env 
    
    NEW_RELIC_ENABLED=true
    
  4. Use make up to start Databand with the New Relic agent enabled.

Enabling Datadog

By default, Datadog logging is disabled when you are using make up.

To enable Datadog logging:

  1. Set DATADOG_ENABLED=true and override the following variables to match your setup in custom.env:
    ## custom.env
    
    ## add datadog.yml mod to your COMPOSE_FILE variable
    COMPOSE_FILE=docker-compose.yml:./mods/datadog.yml
    DATADOG_ENABLED=true
    
    DATADOG_SITE=VALUE
    DATADOG_API_KEY=VALUE
    DATADOG_ENV=VALUE
  2. Use make up to launch Databand with the Datadog agent enabled.