Mirador Core - Monitoring and Observability Guide
This document provides comprehensive guidance for monitoring and observing Mirador Core , including metrics collection, distributed tracing, performance monitoring, and alerting rules.
Table of Contents
Overview
Mirador Core implements comprehensive monitoring and observability capabilities to ensure operational visibility and reliability of the unified observability platform. The monitoring stack includes:
Prometheus for metrics collection and storage
OpenTelemetry for distributed tracing
AlertManager for alerting and notifications
Metrics Collection
Unified Query Engine Metrics
Metric Name |
Type |
Description |
Labels |
|---|---|---|---|
|
Counter |
Total number of unified query operations |
|
|
Histogram |
Duration of unified query operations |
|
|
Counter |
Cache operations for unified queries |
|
|
Counter |
Correlation operations within unified queries |
|
Correlation Engine Metrics
Metric Name |
Type |
Description |
Labels |
|---|---|---|---|
|
Counter |
Total correlation operations |
|
|
Histogram |
Duration of correlation operations |
|
|
Histogram |
Duration of individual engine queries |
|
|
Histogram |
Duration of parallel execution coordination |
|
|
Histogram |
Duration of result merging operations |
|
|
Counter |
Cache operations for correlations |
|
|
Counter |
Correlation-specific errors |
|
|
Gauge |
Current memory usage |
|
|
Counter |
CPU usage in seconds |
Tracing Metrics
Metric Name |
Type |
Description |
Labels |
|---|---|---|---|
|
Counter |
Total traces started |
|
|
Counter |
Total traces completed |
|
|
Gauge |
Currently active traces |
|
|
Counter |
Traces that were sampled |
|
|
Histogram |
Duration of query traces |
|
|
Histogram |
Duration of correlation traces |
|
|
Counter |
Total spans created |
|
|
Counter |
Trace-related errors |
|
|
Counter |
Trace export operations |
|
|
Counter |
Service-to-service calls |
|
|
Counter |
Span attributes usage |
|
General Metrics
Metric Name |
Type |
Description |
Labels |
|---|---|---|---|
|
Counter |
General application errors |
|
MariaDB Integration Metrics
When MariaDB integration is enabled, the following metrics are available:
Metric Name |
Type |
Description |
Labels |
|---|---|---|---|
|
Gauge |
MariaDB connection status (1=connected, 0=disconnected) |
|
|
Counter |
Total MariaDB queries |
|
|
Histogram |
Duration of MariaDB queries |
|
|
Counter |
KPI sync operations |
|
|
Histogram |
Duration of KPI sync operations |
|
|
Counter |
Number of KPIs synced |
|
MariaDB Health Check
The /api/v1/health endpoint includes MariaDB status:
curl http://localhost:8010/api/v1/health | jq '.components.mariadb'
Response:
{
"enabled": true,
"connected": true,
"host": "mariadb.example.com",
"database": "tenant_acme"
}
MariaDB Alerts
Add these alerts for MariaDB monitoring:
# MariaDB connection alert
- alert: MariaDBConnectionLost
expr: mirador_core_mariadb_connected == 0
for: 1m
labels:
severity: warning
annotations:
summary: "MariaDB connection lost"
description: "MariaDB connection has been down for more than 1 minute"
# KPI sync failure alert
- alert: KPISyncFailure
expr: rate(mirador_core_kpi_sync_operations_total{status="failure"}[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "KPI sync failures detected"
description: "KPI sync to Weaviate is experiencing failures"
Distributed Tracing
Trace Structure
Mirador Core implements hierarchical tracing with the following span structure:
unified-query-{query_type}
├── query-parsing
├── engine-routing
├── cache-lookup
├── {engine_type}-query
│ ├── query-execution
│ └── result-processing
├── correlation-{correlation_type} (if applicable)
│ ├── parallel-execution
│ │ ├── {engine_type}-query
│ │ └── {engine_type}-query
│ └── result-merging
└── response-formatting
Span Attributes
All spans include the following attributes:
service.name: “mirador-core”service.version: “v10.0.1”operation.name: Specific operation being tracedquery.id: Unique query identifieruser.id: User identifier (if available)query.type: Type of query (metrics, logs, traces)engine.type: Engine used for execution
Sampling Configuration
Tracing uses adaptive sampling based on:
Query complexity (number of engines involved)
Query latency (high latency queries are sampled more)
Error rate (failed queries are always sampled)
System load (reduced sampling under high load)
Alerting Rules
Alert Categories
Performance Alerts
HighQueryLatency: Query latency exceeds 5 seconds (95th percentile)
QuerySuccessRateLow: Query success rate drops below 95%
CacheHitRateLow: Cache hit rate drops below 70%
HighCorrelationLatency: Correlation latency exceeds 10 seconds
CorrelationSuccessRateLow: Correlation success rate drops below 90%
EngineQueryTimeout: Individual engine queries exceed 30 seconds
Reliability Alerts
TraceExportFailure: Trace export failure rate exceeds 5%
HighTraceErrorRate: Trace error rate exceeds 10 errors/second
HighErrorRate: General error rate exceeds 5 errors/second per component
ServiceDown: Mirador Core service is unavailable
Resource Alerts
HighMemoryUsage: Memory usage exceeds 85% of limit
HighCPUUsage: CPU usage exceeds 80%
QueryThroughputDrop: Query throughput drops below 10 ops/sec
CorrelationThroughputDrop: Correlation throughput drops below 5 ops/sec
Alert Configuration
Alerts are configured in deployments/grafana/alerting-rules.yml and should be loaded into Prometheus AlertManager.
Example AlertManager configuration:
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'mirador-alerts'
routes:
- match:
severity: critical
receiver: 'mirador-critical'
receivers:
- name: 'mirador-alerts'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#mirador-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
Configuration
Prometheus Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "deployments/grafana/alerting-rules.yml"
scrape_configs:
- job_name: 'mirador-core'
static_configs:
- targets: ['mirador-core:9090']
metrics_path: '/metrics'
OpenTelemetry Configuration
# tracing configuration in config.yaml
tracing:
enabled: true
service_name: "mirador-core"
service_version: "v10.0.1"
sampling_ratio: 0.1
jaeger_endpoint: "http://jaeger:14268/api/traces"
otlp_endpoint: "http://otel-collector:4318"
# Resource attributes
resource_attributes:
service.name: "mirador-core"
service.version: "v10.0.1"
service.instance.id: "${HOSTNAME}"
deployment.environment: "${ENVIRONMENT}"
Troubleshooting
Common Issues
Metrics Not Appearing in Prometheus
Check if the
/metricsendpoint is accessible:curl http://mirador-core:9090/metricsVerify Prometheus scrape configuration targets the correct port and path
Check Mirador Core logs for metrics collection errors
Traces Not Appearing in Jaeger
Verify OpenTelemetry configuration is correct
Check network connectivity to Jaeger endpoint
Ensure proper sampling configuration
Check Mirador Core logs for tracing errors
Alerts Not Firing
Check Prometheus AlertManager configuration
Verify alerting rules are loaded:
promtool check rules alerting-rules.ymlConfirm alert conditions are met by querying metrics directly
Check AlertManager logs for delivery issues
Performance Tuning
High Cardinality Metrics
Monitor for high cardinality in label combinations, especially:
query_type×engine_routedcombinationscorrelation_type×engines_countcombinationsoperation×attribute_namein tracing
Sampling Optimization
Adjust sampling ratios based on:
Traffic volume
Storage capacity
Required observability granularity
Performance impact tolerance
Resource Usage
Monitor and tune:
Memory usage for correlation result caching
CPU usage during parallel execution
Network bandwidth for trace exports
Storage requirements for metrics retention
Log Analysis
Key log patterns to monitor:
ERROR.*correlation.*timeout- Engine timeoutsWARN.*cache.*miss.*rate- Cache performance issuesERROR.*trace.*export- Tracing export failuresWARN.*memory.*usage.*high- Resource pressure