Mirador Core - Monitoring and Observability Guide

This document provides comprehensive guidance for monitoring and observing Mirador Core , including metrics collection, distributed tracing, performance monitoring, and alerting rules.

Table of Contents

  1. Overview

  2. Metrics Collection

  3. Distributed Tracing

  4. Alerting Rules

  5. Configuration

  6. Troubleshooting

Overview

Mirador Core implements comprehensive monitoring and observability capabilities to ensure operational visibility and reliability of the unified observability platform. The monitoring stack includes:

  • Prometheus for metrics collection and storage

  • OpenTelemetry for distributed tracing

  • AlertManager for alerting and notifications

Metrics Collection

Unified Query Engine Metrics

Metric Name

Type

Description

Labels

mirador_core_unified_query_operations_total

Counter

Total number of unified query operations

query_type, status, engine_routed

mirador_core_unified_query_operation_duration_seconds

Histogram

Duration of unified query operations

query_type, status

mirador_core_unified_query_cache_operations_total

Counter

Cache operations for unified queries

result (hit/miss)

mirador_core_unified_query_correlation_operations_total

Counter

Correlation operations within unified queries

correlation_type, status

Correlation Engine Metrics

Metric Name

Type

Description

Labels

mirador_core_correlation_operations_total

Counter

Total correlation operations

correlation_type, status

mirador_core_correlation_duration_seconds

Histogram

Duration of correlation operations

correlation_type

mirador_core_correlation_engine_query_duration_seconds

Histogram

Duration of individual engine queries

engine_type

mirador_core_correlation_parallel_execution_duration_seconds

Histogram

Duration of parallel execution coordination

engines_count

mirador_core_correlation_result_merging_duration_seconds

Histogram

Duration of result merging operations

correlations_count

mirador_core_correlation_cache_operations_total

Counter

Cache operations for correlations

result (hit/miss)

mirador_core_correlation_errors_total

Counter

Correlation-specific errors

error_type

mirador_core_correlation_memory_usage_bytes

Gauge

Current memory usage

mirador_core_correlation_cpu_usage_seconds_total

Counter

CPU usage in seconds

Tracing Metrics

Metric Name

Type

Description

Labels

mirador_core_traces_started_total

Counter

Total traces started

mirador_core_traces_completed_total

Counter

Total traces completed

mirador_core_traces_active_total

Gauge

Currently active traces

mirador_core_traces_sampled_total

Counter

Traces that were sampled

mirador_core_query_trace_duration_seconds

Histogram

Duration of query traces

query_type

mirador_core_correlation_trace_duration_seconds

Histogram

Duration of correlation traces

correlation_type

mirador_core_spans_created_total

Counter

Total spans created

operation

mirador_core_trace_errors_total

Counter

Trace-related errors

error_type

mirador_core_trace_exports_total

Counter

Trace export operations

status (success/failure)

mirador_core_service_calls_total

Counter

Service-to-service calls

source_service, target_service

mirador_core_span_attributes_total

Counter

Span attributes usage

attribute_name

General Metrics

Metric Name

Type

Description

Labels

mirador_core_errors_total

Counter

General application errors

component

MariaDB Integration Metrics

When MariaDB integration is enabled, the following metrics are available:

Metric Name

Type

Description

Labels

mirador_core_mariadb_connected

Gauge

MariaDB connection status (1=connected, 0=disconnected)

database

mirador_core_mariadb_queries_total

Counter

Total MariaDB queries

table, operation, status

mirador_core_mariadb_query_duration_seconds

Histogram

Duration of MariaDB queries

table, operation

mirador_core_kpi_sync_operations_total

Counter

KPI sync operations

status (success/failure)

mirador_core_kpi_sync_duration_seconds

Histogram

Duration of KPI sync operations

mirador_core_kpi_sync_items_total

Counter

Number of KPIs synced

operation (create/update)

MariaDB Health Check

The /api/v1/health endpoint includes MariaDB status:

curl http://localhost:8010/api/v1/health | jq '.components.mariadb'

Response:

{
  "enabled": true,
  "connected": true,
  "host": "mariadb.example.com",
  "database": "tenant_acme"
}

MariaDB Alerts

Add these alerts for MariaDB monitoring:

# MariaDB connection alert
- alert: MariaDBConnectionLost
  expr: mirador_core_mariadb_connected == 0
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: "MariaDB connection lost"
    description: "MariaDB connection has been down for more than 1 minute"

# KPI sync failure alert
- alert: KPISyncFailure
  expr: rate(mirador_core_kpi_sync_operations_total{status="failure"}[5m]) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "KPI sync failures detected"
    description: "KPI sync to Weaviate is experiencing failures"

Distributed Tracing

Trace Structure

Mirador Core implements hierarchical tracing with the following span structure:

unified-query-{query_type}
├── query-parsing
├── engine-routing
├── cache-lookup
├── {engine_type}-query
│   ├── query-execution
│   └── result-processing
├── correlation-{correlation_type} (if applicable)
│   ├── parallel-execution
│   │   ├── {engine_type}-query
│   │   └── {engine_type}-query
│   └── result-merging
└── response-formatting

Span Attributes

All spans include the following attributes:

  • service.name: “mirador-core”

  • service.version: “v10.0.1”

  • operation.name: Specific operation being traced

  • query.id: Unique query identifier

  • user.id: User identifier (if available)

  • query.type: Type of query (metrics, logs, traces)

  • engine.type: Engine used for execution

Sampling Configuration

Tracing uses adaptive sampling based on:

  • Query complexity (number of engines involved)

  • Query latency (high latency queries are sampled more)

  • Error rate (failed queries are always sampled)

  • System load (reduced sampling under high load)

Alerting Rules

Alert Categories

Performance Alerts

  • HighQueryLatency: Query latency exceeds 5 seconds (95th percentile)

  • QuerySuccessRateLow: Query success rate drops below 95%

  • CacheHitRateLow: Cache hit rate drops below 70%

  • HighCorrelationLatency: Correlation latency exceeds 10 seconds

  • CorrelationSuccessRateLow: Correlation success rate drops below 90%

  • EngineQueryTimeout: Individual engine queries exceed 30 seconds

Reliability Alerts

  • TraceExportFailure: Trace export failure rate exceeds 5%

  • HighTraceErrorRate: Trace error rate exceeds 10 errors/second

  • HighErrorRate: General error rate exceeds 5 errors/second per component

  • ServiceDown: Mirador Core service is unavailable

Resource Alerts

  • HighMemoryUsage: Memory usage exceeds 85% of limit

  • HighCPUUsage: CPU usage exceeds 80%

  • QueryThroughputDrop: Query throughput drops below 10 ops/sec

  • CorrelationThroughputDrop: Correlation throughput drops below 5 ops/sec

Alert Configuration

Alerts are configured in deployments/grafana/alerting-rules.yml and should be loaded into Prometheus AlertManager.

Example AlertManager configuration:

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'mirador-alerts'
  routes:
  - match:
      severity: critical
    receiver: 'mirador-critical'

receivers:
- name: 'mirador-alerts'
  slack_configs:
  - api_url: 'YOUR_SLACK_WEBHOOK_URL'
    channel: '#mirador-alerts'
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ .CommonAnnotations.description }}'

Configuration

Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "deployments/grafana/alerting-rules.yml"

scrape_configs:
  - job_name: 'mirador-core'
    static_configs:
      - targets: ['mirador-core:9090']
    metrics_path: '/metrics'

OpenTelemetry Configuration

# tracing configuration in config.yaml
tracing:
  enabled: true
  service_name: "mirador-core"
  service_version: "v10.0.1"
  sampling_ratio: 0.1
  jaeger_endpoint: "http://jaeger:14268/api/traces"
  otlp_endpoint: "http://otel-collector:4318"

  # Resource attributes
  resource_attributes:
    service.name: "mirador-core"
    service.version: "v10.0.1"
    service.instance.id: "${HOSTNAME}"
    deployment.environment: "${ENVIRONMENT}"

Troubleshooting

Common Issues

Metrics Not Appearing in Prometheus

  1. Check if the /metrics endpoint is accessible:

    curl http://mirador-core:9090/metrics
    
  2. Verify Prometheus scrape configuration targets the correct port and path

  3. Check Mirador Core logs for metrics collection errors

Traces Not Appearing in Jaeger

  1. Verify OpenTelemetry configuration is correct

  2. Check network connectivity to Jaeger endpoint

  3. Ensure proper sampling configuration

  4. Check Mirador Core logs for tracing errors

Alerts Not Firing

  1. Check Prometheus AlertManager configuration

  2. Verify alerting rules are loaded: promtool check rules alerting-rules.yml

  3. Confirm alert conditions are met by querying metrics directly

  4. Check AlertManager logs for delivery issues

Performance Tuning

High Cardinality Metrics

Monitor for high cardinality in label combinations, especially:

  • query_type × engine_routed combinations

  • correlation_type × engines_count combinations

  • operation × attribute_name in tracing

Sampling Optimization

Adjust sampling ratios based on:

  • Traffic volume

  • Storage capacity

  • Required observability granularity

  • Performance impact tolerance

Resource Usage

Monitor and tune:

  • Memory usage for correlation result caching

  • CPU usage during parallel execution

  • Network bandwidth for trace exports

  • Storage requirements for metrics retention

Log Analysis

Key log patterns to monitor:

  • ERROR.*correlation.*timeout - Engine timeouts

  • WARN.*cache.*miss.*rate - Cache performance issues

  • ERROR.*trace.*export - Tracing export failures

  • WARN.*memory.*usage.*high - Resource pressure

Runbooks