# KPI, Failures, Correlation & RCA User Guide **Version:** 10.0.0 **Last Updated:** January 2026 **Target Audience:** API Consumers & Integration Engineers --- ## Table of Contents 1. [Overview](#overview) 2. [Prerequisites & Setup](#prerequisites--setup) 3. [Configuration & Deployment](#configuration--deployment) 4. [Component 1: KPI Management](#component-1-kpi-management) 5. [Component 2: Failure Detection](#component-2-failure-detection) 6. [Component 3: Correlation Analysis](#component-3-correlation-analysis) 7. [Component 4: Root Cause Analysis (RCA)](#component-4-root-cause-analysis-rca) 8. [Complete Workflows](#complete-workflows) 9. [API Reference](#api-reference) 10. [Troubleshooting](#troubleshooting) --- ## Overview ### What This Guide Covers This guide explains how to use Mirador Core's four interconnected observability components: ``` ┌──────────┐ ┌──────────┐ ┌─────────────┐ ┌──────┐ │ KPIs │ --> │ Failures │ --> │ Correlation │ --> │ RCA │ └──────────┘ └──────────┘ └─────────────┘ └──────┘ Define Detect Analyze Explain Metrics Incidents Patterns Root Cause ``` ### Dependency Chain Each component builds upon the previous: 1. **KPIs (Key Performance Indicators)**: Define what metrics to monitor 2. **Failures**: Detect incidents based on KPI anomalies and error signals 3. **Correlation**: Perform statistical analysis to find relationships between KPIs 4. **RCA (Root Cause Analysis)**: Use correlation data + 5 WHY methodology to identify root causes **Important**: Without KPIs defined, Failure Detection will have limited effectiveness. Without Failures and KPIs, RCA cannot function. ### Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Mirador Core API │ │ (localhost:8010) │ └─────────────────────────────────────────────────────────────┘ │ ┌───────────────────┼───────────────────┐ │ │ │ ┌───▼────┐ ┌──────▼───────┐ ┌────▼─────┐ │ KPI │ │ Correlation │ │ RCA │ │ Repo │ │ Engine │ │ Engine │ └───┬────┘ └──────┬───────┘ └────┬─────┘ │ │ │ │ ┌───────▼────────┐ │ │ │ Failure │ │ └──────────► Detection ◄──────────┘ └────────┬───────┘ │ ┌───────────────────┼───────────────────┐ │ │ │ ┌────▼────┐ ┌──────▼───────┐ ┌─────▼─────┐ │Victoria │ │ Victoria │ │ Victoria │ │ Metrics │ │ Logs │ │ Traces │ └─────────┘ └──────────────┘ └───────────┘ ``` --- ## Prerequisites & Setup ### System Requirements **Mirador Core:** - Go 1.21+ (for building from source) - Docker & Docker Compose (for containerized deployment) - 4GB RAM minimum (8GB recommended) - 20GB disk space **Required Backend Services:** - VictoriaMetrics (metrics storage) - VictoriaLogs (logs storage) - VictoriaTraces (traces storage) - Valkey/Redis (caching) - Weaviate (optional, for KPI vector storage) ### Quick Start **Using Docker Compose (Recommended):** ```bash # 1. Clone repository git clone https://github.com/mirastacklabs-ai/mirador-core cd mirador-core # 2. Start all services make localdev-up # 3. Wait for services to be ready (monitors health checks) make localdev-wait # 4. Verify Mirador Core is running curl http://localhost:8010/api/v1/health # 5. Seed sample KPIs (optional) make localdev-seed-data ``` **Expected Output:** ```json { "status": "healthy", "timestamp": "2026-01-23T10:00:00Z", "services": { "mirador-core": "ok", "victoriametrics": "ok", "victorialogs": "ok", "victoriatraces": "ok", "valkey": "ok" } } ``` ### Access Points Once running, you can access: - **Mirador Core API**: `http://localhost:8010` - **Swagger UI**: `http://localhost:8010/swagger/index.html` - **OpenAPI Spec**: `http://localhost:8010/api/openapi.yaml` - **Health Check**: `http://localhost:8010/health` - **Prometheus Metrics**: `http://localhost:8010/metrics` --- ## Configuration & Deployment ### Configuration File Structure Mirador Core uses a YAML configuration file located at `configs/config.yaml`: ```yaml # Basic Settings environment: production # or development port: 8010 log_level: info # VictoriaMetrics Ecosystem database: victoria_metrics: endpoints: - "http://victoriametrics:8428" timeout: 30000 cluster_mode: false victoria_logs: endpoints: - "http://victorialogs:9428" timeout: 30000 victoria_traces: endpoints: - "http://victoriatraces:10428" timeout: 30000 # Caching (Valkey/Redis) cache: nodes: - "valkey:6379" ttl: 300 # 5 minutes default password: "" # Set via env: CACHE_PASSWORD db: 0 # CORS (for frontend integration) cors: allowed_origins: - "https://your-mirador-ui.com" allowed_methods: - "GET" - "POST" - "PUT" - "DELETE" # Weaviate (optional - for KPI vector storage) weaviate: enabled: true scheme: "http" host: "weaviate" port: 8080 vectorizer: provider: "text2vec-transformers" model: "sentence-transformers/all-MiniLM-L6-v2" use_gpu: false # Engine Configuration engine: # Time window constraints min_window: 1m max_window: 1h # Payload validation strict_time_window_payload: true strict_time_window: true # Correlation settings correlation_threshold: 0.7 default_graph_hops: 3 default_max_whys: 5 # Ring strategy for RCA ring_strategy: "default" # Query limits default_query_limit: 1000 ``` ### Environment Variables Override configuration via environment variables: ```bash # Database credentials export VM_PASSWORD="your-victoriametrics-password" export VL_PASSWORD="your-victorialogs-password" # Cache credentials export CACHE_PASSWORD="your-valkey-password" # Application settings export PORT=8010 export LOG_LEVEL=info export ENVIRONMENT=production # Weaviate connection export WEAVIATE_HOST=weaviate export WEAVIATE_PORT=8080 ``` ### Docker Deployment **Production docker-compose.yml:** ```yaml version: '3.8' services: mirador-core: image: miradorstack/mirador-core:latest container_name: mirador-core ports: - "8010:8010" environment: - ENVIRONMENT=production - LOG_LEVEL=info - VM_ENDPOINT=http://victoriametrics:8428 - VL_ENDPOINT=http://victorialogs:9428 - VT_ENDPOINT=http://victoriatraces:10428 - CACHE_NODES=valkey:6379 - CACHE_PASSWORD=${CACHE_PASSWORD} - WEAVIATE_HOST=weaviate - WEAVIATE_PORT=8080 volumes: - ./configs:/app/configs:ro depends_on: - victoriametrics - victorialogs - victoriatraces - valkey - weaviate restart: unless-stopped networks: - mirador-net healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8010/health"] interval: 30s timeout: 10s retries: 3 start_period: 40s victoriametrics: image: victoriametrics/victoria-metrics:latest container_name: victoriametrics ports: - "8428:8428" volumes: - vmdata:/victoria-metrics-data command: - --storageDataPath=/victoria-metrics-data - --search.maxUniqueTimeseries=2000000 - --memory.allowedPercent=90 networks: - mirador-net restart: unless-stopped victorialogs: image: victoriametrics/victoria-logs:latest container_name: victorialogs ports: - "9428:9428" volumes: - vldata:/victoria-logs-data command: - --storageDataPath=/victoria-logs-data networks: - mirador-net restart: unless-stopped victoriatraces: image: victoriametrics/victoria-traces:latest container_name: victoriatraces ports: - "10428:10428" volumes: - vtdata:/victoria-traces-data command: - --storageDataPath=/victoria-traces-data networks: - mirador-net restart: unless-stopped valkey: image: valkey/valkey:latest container_name: valkey ports: - "6379:6379" command: > valkey-server --requirepass ${CACHE_PASSWORD} --maxmemory 2gb --maxmemory-policy allkeys-lru volumes: - valkeydata:/data networks: - mirador-net restart: unless-stopped weaviate: image: semitechnologies/weaviate:latest container_name: weaviate ports: - "8080:8080" environment: QUERY_DEFAULTS_LIMIT: 25 AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true' PERSISTENCE_DATA_PATH: '/var/lib/weaviate' DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers' ENABLE_MODULES: 'text2vec-transformers' TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080' volumes: - weaviatedata:/var/lib/weaviate networks: - mirador-net restart: unless-stopped t2v-transformers: image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2 container_name: t2v-transformers environment: ENABLE_CUDA: '0' networks: - mirador-net restart: unless-stopped networks: mirador-net: driver: bridge volumes: vmdata: vldata: vtdata: valkeydata: weaviatedata: ``` **Start production environment:** ```bash # Export required environment variables export CACHE_PASSWORD="your-secure-password" # Start services docker-compose up -d # Check logs docker-compose logs -f mirador-core # Verify health curl http://localhost:8010/api/v1/health ``` ### Kubernetes Deployment For Kubernetes deployments, see `deployments/k8s/` directory which includes: - Deployment manifests - Service definitions - ConfigMaps and Secrets - Ingress configuration - Horizontal Pod Autoscaler (HPA) **Quick deploy:** ```bash # Apply all K8s resources kubectl apply -f deployments/k8s/ # Check deployment status kubectl get pods -n mirador # Port-forward for local access kubectl port-forward -n mirador svc/mirador-core 8010:8010 ``` --- ## Component 1: KPI Management ### What are KPIs? **Key Performance Indicators (KPIs)** are the foundation of observability in Mirador Core. They define: - **What metrics to monitor** (e.g., API error rates, latency, transaction volume) - **Where to find the data** (metrics, logs, traces) - **How to query it** (formulas, query objects) - **Business context** (impact layer vs cause layer, sentiment, domain) KPIs enable: - **Registry-driven monitoring**: Central source of truth for all monitored signals - **Correlation discovery**: Automatic relationship detection between metrics - **RCA accuracy**: Better root cause chains with pre-defined impact/cause layers - **Natural language search**: Vector-based semantic search over KPI descriptions ### KPI Structure A KPI definition contains: ```json { "id": "kpi-uuid-123", "name": "api_errors_total", "kind": "tech", "layer": "impact", "signalType": "metrics", "sentiment": "negative", "classifier": "errors", "datastore": "victoriametrics", "queryType": "MetricsQL", "unit": "count", "format": "integer", "formula": "sum(rate(http_requests_total{status=~\"5..\"}[5m]))", "definition": "Total API errors at the gateway per minute", "businessImpact": "Revenue loss due to failed customer transactions", "description": "Tracks API gateway errors - critical service health indicator", "tags": ["api", "errors", "critical"], "dataType": "timeseries", "aggregationWindowHint": "1m", "dimensionsHint": ["service.name", "region"], "refreshInterval": 60, "isShared": true, "userId": "user-uuid-456" } ``` **Required Fields:** - `name`: Unique identifier - `layer`: `impact` (business/user-facing) or `cause` (infrastructure/technical) - `sentiment`: `positive`, `negative`, or `neutral` - `signalType`: `metrics`, `logs`, `traces`, `business`, `synthetic` **Optional But Recommended:** - `formula` or `query`: How to fetch the data - `businessImpact`: Why this metric matters - `tags`: For filtering and organization ### Creating KPIs **Single KPI Creation:** ```bash POST /api/v1/kpi/defs Content-Type: application/json { "kpiDefinition": { "name": "payment_processing_errors", "kind": "business", "layer": "impact", "signalType": "metrics", "sentiment": "negative", "classifier": "errors", "datastore": "victoriametrics", "queryType": "MetricsQL", "unit": "count", "format": "integer", "formula": "sum(rate(payment_errors_total[1m]))", "definition": "Failed payment transactions per minute", "businessImpact": "Direct revenue loss - each error = failed customer transaction", "description": "Critical business KPI tracking payment processing health", "tags": ["payments", "critical", "revenue"], "dataType": "timeseries", "aggregationWindowHint": "1m", "dimensionsHint": ["payment_method", "region"], "serviceFamily": "payments", "domain": "transactions", "refreshInterval": 60, "isShared": true } } ``` **Response (201 Created):** ```json { "status": "created", "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479" } ``` **Using Query Object (Alternative to Formula):** ```json { "kpiDefinition": { "name": "database_latency_p99", "kind": "tech", "layer": "cause", "signalType": "metrics", "sentiment": "negative", "queryType": "MetricsQL", "query": { "metric": "db_query_duration_seconds", "aggregation": "quantile", "quantile": 0.99, "window": "5m" }, "unit": "seconds", "format": "float", "definition": "99th percentile database query latency", "description": "Tracks database performance degradation", "tags": ["database", "latency", "performance"] } } ``` ### Bulk KPI Import **JSON Bulk Import:** ```bash POST /api/v1/kpi/defs/bulk-json Content-Type: application/json { "kpiDefinitions": [ { "name": "api_latency_p95", "layer": "impact", "sentiment": "negative", "signalType": "metrics", "formula": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", "description": "API 95th percentile latency" }, { "name": "kafka_consumer_lag", "layer": "cause", "sentiment": "negative", "signalType": "metrics", "formula": "sum(kafka_consumer_lag_max) by (consumer_group)", "description": "Kafka consumer lag indicating processing delays" } ] } ``` **CSV Bulk Import:** ```bash POST /api/v1/kpi/defs/bulk-csv Content-Type: text/csv name,layer,sentiment,signalType,formula,description cpu_usage_percent,cause,negative,metrics,avg(rate(node_cpu_seconds_total[5m])) * 100,CPU utilization percentage memory_usage_percent,cause,negative,metrics,100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes),Memory utilization percentage disk_io_wait,cause,negative,metrics,rate(node_disk_io_time_seconds_total[5m]),Disk I/O wait time ``` ### Listing and Filtering KPIs **Get all KPIs (paginated):** ```bash GET /api/v1/kpi/defs?limit=10&offset=0 ``` **Filter by layer:** ```bash GET /api/v1/kpi/defs?layer=impact&limit=50 ``` **Filter by tags:** ```bash GET /api/v1/kpi/defs?tags=critical,payments ``` **Filter by multiple criteria:** ```bash GET /api/v1/kpi/defs?layer=cause&sentiment=negative&signalType=metrics&classifier=latency ``` **Response:** ```json { "kpiDefinitions": [ { "id": "kpi-uuid-1", "name": "api_errors_total", "layer": "impact", "sentiment": "negative", "description": "Total API gateway errors" } ], "total": 150, "nextOffset": 10 } ``` ### Searching KPIs (Natural Language) Mirador Core supports vector-based semantic search over KPI descriptions: ```bash POST /api/v1/kpi/search Content-Type: application/json { "query": "payment transaction failures affecting revenue", "limit": 5 } ``` **Response:** ```json { "results": [ { "id": "kpi-uuid-123", "name": "payment_processing_errors", "description": "Failed payment transactions per minute", "score": 0.92 }, { "id": "kpi-uuid-456", "name": "transaction_timeout_total", "description": "Payment transactions timing out", "score": 0.85 } ] } ``` ### Retrieving a Single KPI ```bash GET /api/v1/kpi/defs/{id} ``` **Example:** ```bash GET /api/v1/kpi/defs/f47ac10b-58cc-4372-a567-0e02b2c3d479 ``` **Response:** ```json { "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "name": "payment_processing_errors", "kind": "business", "layer": "impact", "signalType": "metrics", "sentiment": "negative", "formula": "sum(rate(payment_errors_total[1m]))", "definition": "Failed payment transactions per minute", "businessImpact": "Direct revenue loss", "description": "Critical business KPI tracking payment processing health", "tags": ["payments", "critical", "revenue"], "createdAt": "2026-01-20T10:00:00Z", "updatedAt": "2026-01-20T10:00:00Z" } ``` ### Updating a KPI Updates use the same endpoint as creation (upsert behavior): ```bash POST /api/v1/kpi/defs Content-Type: application/json { "kpiDefinition": { "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "name": "payment_processing_errors", "formula": "sum(rate(payment_errors_total[5m]))", # Changed from 1m to 5m "description": "Updated: 5-minute aggregation window" } } ``` **Response (200 OK):** ```json { "status": "ok", "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479" } ``` ### Deleting a KPI ```bash DELETE /api/v1/kpi/defs/{id} ``` **Example:** ```bash DELETE /api/v1/kpi/defs/f47ac10b-58cc-4372-a567-0e02b2c3d479 ``` **Response (200 OK):** ```json { "status": "deleted", "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479" } ``` --- ## Component 2: Failure Detection ### What is Failure Detection? Failure Detection analyzes telemetry data (metrics, logs, traces) within a time window to identify component failures. It: - **Detects error spans**: Traces with `error=true` tag - **Identifies anomalies**: Metrics with `iforest_is_anomaly=true` (from AI anomaly detection) - **Groups by service+component**: Aggregates failures per component - **Generates unique IDs**: Deterministic UUIDs for deduplication - **Persists to storage**: Saves failures to Weaviate for historical analysis ### When to Use Failure Detection - **Incident investigation**: "What failed between 10:00 and 11:00?" - **Post-mortem analysis**: Identify all affected components during an outage - **Pre-RCA preparation**: Gather failure candidates before running RCA - **Monitoring dashboards**: Track failure trends over time ### Detecting Failures **Basic Detection (All Components):** ```bash POST /api/v1/unified/failures/detect Content-Type: application/json { "time_range": { "start": "2026-01-23T10:00:00Z", "end": "2026-01-23T11:00:00Z" } } ``` **Filtered Detection (Specific Components):** ```bash POST /api/v1/unified/failures/detect Content-Type: application/json { "time_range": { "start": "2026-01-23T10:00:00Z", "end": "2026-01-23T11:00:00Z" }, "components": ["kafka", "cassandra", "api-gateway"], "services": ["payment-service", "auth-service"] } ``` **Response:** ```json { "incidents": [ { "incident_id": "incident_kafka_1737626400", "failure_id": "kafka-producer-kafka-20260123-100000", "failure_uuid": "e458d90f-f525-58a9-9e92-9f91faa73cf2", "time_range": { "start": "2026-01-23T10:00:00Z", "end": "2026-01-23T10:15:00Z" }, "primary_component": "kafka", "affected_transaction_ids": ["txn-123", "txn-456"], "services_involved": ["payment-service", "order-service"], "failure_mode": "error_spans", "confidence": 0.92, "severity": "high" } ], "summary": { "total_incidents": 3, "time_range": { "start": "2026-01-23T10:00:00Z", "end": "2026-01-23T11:00:00Z" }, "service_component_summaries": [ { "service": "payment-service", "component": "kafka", "failure_id": "kafka-producer-kafka-20260123-100000", "failure_uuid": "e458d90f-f525-58a9-9e92-9f91faa73cf2", "failure_count": 42, "affected_transactions": 15, "average_anomaly_score": 0.87, "average_confidence": 0.92, "error_spans_count": 38, "error_metrics_count": 4, "last_failure_timestamp": "2026-01-23T10:14:32Z" } ], "metrics_error_summary": { "total_error_metrics": 12, "total_anomaly_metrics": 8, "error_metrics_by_name": [ { "metric_name": "kafka_producer_errors_total", "count": 8, "average_value": 3.5, "last_timestamp": "2026-01-23T10:14:32Z" } ], "anomaly_metrics_by_name": [ { "metric_name": "kafka_producer_latency_ms", "count": 5, "average_value": 1250.3, "last_timestamp": "2026-01-23T10:14:28Z" } ] } } } ``` ### Understanding Failure IDs Each failure gets two identifiers: 1. **failure_id** (human-readable): `{service}-{component}-{YYYYMMDD-HHMMSS}` - Example: `kafka-producer-kafka-20260123-100000` - Easy to read in logs and dashboards 2. **failure_uuid** (deterministic UUID v5): `e458d90f-f525-58a9-9e92-9f91faa73cf2` - Unique identifier for storage and deduplication - Generated from: service + component + timestamp - Same failure detected multiple times = same UUID ### Transaction Failure Correlation Correlate failures for specific transaction IDs to track cascading failures: ```bash POST /api/v1/unified/failures/correlate Content-Type: application/json { "transactionIDs": ["txn-12345", "txn-67890"], "time_range": { "start": "2026-01-23T10:00:00Z", "end": "2026-01-23T11:00:00Z" } } ``` **Response:** ```json { "incidents": [ { "incident_id": "incident_txn_12345", "failure_id": "transaction-correlation-20260123-100000", "affected_transaction_ids": ["txn-12345"], "services_involved": ["api-gateway", "payment-service", "kafka"], "failure_sequence": [ { "service": "api-gateway", "component": "http-server", "timestamp": "2026-01-23T10:00:05Z", "error": "timeout waiting for payment-service" }, { "service": "payment-service", "component": "kafka-producer", "timestamp": "2026-01-23T10:00:04Z", "error": "kafka broker unavailable" } ], "root_component": "kafka", "confidence": 0.95 } ] } ``` ### Listing Stored Failures Retrieve paginated list of historical failures: ```bash POST /api/v1/unified/failures/list Content-Type: application/json { "limit": 20, "offset": 0, "filters": { "severity": "high", "component": "kafka" } } ``` **Response:** ```json { "failures": [ { "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2", "failure_id": "kafka-producer-kafka-20260123-100000", "summary": "Kafka producer failures in payment-service", "severity": "high", "timestamp": "2026-01-23T10:00:00Z", "component": "kafka", "service": "payment-service" } ], "total": 150, "nextOffset": 20 } ``` ### Getting Failure Details Retrieve full failure record with all signals and metadata: ```bash POST /api/v1/unified/failures/get Content-Type: application/json { "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2" } ``` **Response:** ```json { "failure": { "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2", "failure_id": "kafka-producer-kafka-20260123-100000", "time_range": { "start": "2026-01-23T10:00:00Z", "end": "2026-01-23T10:15:00Z" }, "primary_component": "kafka", "services": "payment-service", "affected_transaction_count": 15, "error_signals": [ { "timestamp": "2026-01-23T10:00:05Z", "service": "payment-service", "component": "kafka-producer", "error_message": "broker unavailable", "trace_id": "abc123", "span_id": "def456" } ], "anomaly_signals": [ { "timestamp": "2026-01-23T10:00:03Z", "metric_name": "kafka_producer_latency_ms", "value": 1250.3, "is_anomaly": true, "anomaly_score": 0.92 } ], "metadata": { "detection_timestamp": "2026-01-23T10:16:00Z", "detection_confidence": 0.92, "total_signals": 42 } } } ``` ### Deleting a Failure Remove a failure record from storage: ```bash POST /api/v1/unified/failures/delete Content-Type: application/json { "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2" } ``` **Response:** ```json { "status": "deleted", "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2" } ``` --- ## Component 3: Correlation Analysis ### What is Correlation Analysis? Correlation Analysis performs **statistical analysis** between KPIs to discover relationships and patterns. It: - **Builds temporal rings**: Divides time window into rings (R1: immediate, R2: short, R3: medium, R4: long) - **Discovers impact KPIs**: Identifies metrics showing degradation (red anchors) - **Finds candidate causes**: Detects correlated metrics that may explain the impact - **Computes statistics**: Pearson, Spearman, cross-correlation, partial correlation - **Scores candidates**: Assigns suspicion scores based on statistical strength ### Correlation vs RCA | Aspect | Correlation | RCA | |--------|-------------|-----| | **Purpose** | Find statistical relationships | Explain root cause | | **Output** | List of correlated KPIs with scores | 5-WHY chains with narrative | | **Method** | Statistical analysis | Correlation + reasoning | | **Use Case** | "What else changed?" | "Why did it fail?" | **Key Insight**: RCA **uses** Correlation results internally, then adds causal reasoning. ### Running Correlation Analysis **Time-Window Correlation (Recommended):** ```bash POST /api/v1/unified/correlation Content-Type: application/json { "startTime": "2026-01-23T10:00:00Z", "endTime": "2026-01-23T11:00:00Z" } ``` **Response:** ```json { "status": "success", "result": { "correlationID": "corr_1737626400", "timeRange": { "start": "2026-01-23T10:00:00Z", "end": "2026-01-23T11:00:00Z" }, "rings": { "R1_IMMEDIATE": { "label": "R1_IMMEDIATE", "description": "Anomalies very close to the peak", "duration": "5s", "start": "2026-01-23T10:55:55Z", "end": "2026-01-23T11:00:00Z" }, "R2_SHORT": { "label": "R2_SHORT", "description": "Anomalies shortly before peak", "duration": "30s", "start": "2026-01-23T10:55:25Z", "end": "2026-01-23T10:55:55Z" }, "R3_MEDIUM": { "label": "R3_MEDIUM", "description": "Anomalies moderately before peak", "duration": "2m", "start": "2026-01-23T10:53:25Z", "end": "2026-01-23T10:55:25Z" }, "R4_LONG": { "label": "R4_LONG", "description": "Anomalies further back", "duration": "10m", "start": "2026-01-23T10:43:25Z", "end": "2026-01-23T10:53:25Z" } }, "affectedServices": ["payment-service", "kafka"], "redAnchors": [ { "service": "payment-service", "metric": "payment_processing_errors", "score": 0.95, "ring": "R1_IMMEDIATE", "labelFingerprint": { "service.name": "payment-service", "region": "us-east-1" } } ], "causes": [ { "kpi": "kafka_producer_latency_ms", "service": "payment-service", "suspicionScore": 0.89, "ring": "R2_SHORT", "reasons": [ "high_pearson_correlation", "high_spearman_correlation", "temporal_precedence", "high_anomaly_density" ], "stats": { "pearson": 0.87, "spearman": 0.91, "crossCorrMax": 0.88, "crossCorrLag": -2, "partial": 0.82 }, "labelFingerprint": { "service.name": "payment-service", "component": "kafka-producer" } }, { "kpi": "kafka_broker_connection_errors", "service": "kafka", "suspicionScore": 0.92, "ring": "R3_MEDIUM", "reasons": [ "high_pearson_correlation", "temporal_precedence", "upstream_component" ], "stats": { "pearson": 0.93, "spearman": 0.89, "crossCorrMax": 0.91, "crossCorrLag": -5, "partial": 0.85 } } ], "confidence": 0.91, "createdAt": "2026-01-23T11:01:00Z" } } ``` ### Understanding Correlation Results **Red Anchors:** - Metrics showing **impact** (business/user-facing degradation) - Typically from KPIs with `layer=impact` - High scores = strong impact signal **Cause Candidates:** - Metrics that **correlate** with red anchors - Typically from KPIs with `layer=cause` - Ranked by suspicion score (higher = more suspicious) **Statistics Explained:** | Metric | Meaning | Range | |--------|---------|-------| | **Pearson** | Linear correlation strength | -1 to +1 | | **Spearman** | Rank correlation (monotonic relationship) | -1 to +1 | | **CrossCorrMax** | Maximum cross-correlation | -1 to +1 | | **CrossCorrLag** | Time lag (negative = cause precedes impact) | seconds | | **Partial** | Correlation after removing confounders | -1 to +1 | **Suspicion Score Calculation:** ``` suspicionScore = weighted_average( pearson_correlation, spearman_correlation, cross_correlation_max, partial_correlation, anomaly_density, temporal_precedence ) ``` **Reasons (Why This Candidate is Suspicious):** - `high_pearson_correlation`: Strong linear relationship - `high_spearman_correlation`: Strong monotonic relationship - `temporal_precedence`: Cause occurred before impact - `high_anomaly_density`: Many anomalies in this metric - `upstream_component`: Component is upstream in service graph ### Time Window Constraints Correlation enforces time window limits from configuration: ```yaml engine: min_window: 1m # Minimum analysis window max_window: 1h # Maximum analysis window ``` **Invalid Windows:** ```bash # Too small {"startTime": "2026-01-23T10:00:00Z", "endTime": "2026-01-23T10:00:30Z"} # Error: time window too small: 30s < minWindow 1m # Too large {"startTime": "2026-01-23T00:00:00Z", "endTime": "2026-01-23T23:59:59Z"} # Error: time window too large: 23h59m59s > maxWindow 1h ``` --- ## Component 4: Root Cause Analysis (RCA) ### What is RCA? Root Cause Analysis (RCA) combines: 1. **Correlation results** (statistical relationships) 2. **5 WHY methodology** (iterative questioning) 3. **Service topology** (upstream/downstream relationships) 4. **Temporal rings** (time-based evidence) To produce **human-readable explanations** of why incidents occurred. ### RCA Process Flow ``` 1. User provides time window ↓ 2. RCA engine calls Correlation engine ↓ 3. Correlation returns: - Red anchors (impacts) - Cause candidates (correlated metrics) - Statistical evidence ↓ 4. RCA builds 5-WHY chains: - WHY 1: Business impact (what failed?) - WHY 2: Entry service degradation - WHY 3-5: Upstream causes (evidence-driven) ↓ 5. Returns structured narrative with: - Impact summary - Causal chains - Time rings - Diagnostic details ``` ### Running RCA **Basic RCA Request:** ```bash POST /api/v1/unified/rca Content-Type: application/json { "startTime": "2026-01-23T10:00:00Z", "endTime": "2026-01-23T11:00:00Z" } ``` **Response:** ```json { "status": "success", "data": { "impact": { "id": "incident_payment_service_1737626400", "impactService": "payment-service", "metricName": "payment_processing_errors", "timeStart": "2026-01-23T10:00:00Z", "timeEnd": "2026-01-23T11:00:00Z", "impactSummary": "Impact detected on payment-service (correlation confidence 0.91). Top-candidate kafka_producer_latency_ms: pearson=0.87 spearman=0.91 partial=0.82 cross_max=0.88 lag=-2 anomalies=HIGH", "severity": 0.91 }, "chains": [ { "steps": [ { "why": 1, "service": "payment-service", "kpiName": "payment_processing_errors", "timeRange": { "start": "2026-01-23T10:00:00Z", "end": "2026-01-23T11:00:00Z" }, "ring": "R1_IMMEDIATE", "direction": "SAME", "score": 0.91, "evidence": [ { "type": "red_anchor", "key": "payment-service", "value": "anchor_score=0.950" } ], "summary": "Payment processing failed: payment_processing_errors increased dramatically in R1_IMMEDIATE (0.91 confidence)" }, { "why": 2, "service": "payment-service", "kpiName": "payment_processing_errors", "timeRange": { "start": "2026-01-23T10:00:00Z", "end": "2026-01-23T11:00:00Z" }, "ring": "R1_IMMEDIATE", "direction": "UPSTREAM", "score": 0.95, "evidence": [ { "type": "red_anchor", "key": "payment-service", "value": "metric=payment_processing_errors score=0.950" } ], "summary": "payment-service degraded: payment_processing_errors showed anomalies in R1_IMMEDIATE (0.95 confidence)" }, { "why": 3, "service": "payment-service", "kpiName": "kafka_producer_latency_ms", "timeRange": { "start": "2026-01-23T10:00:00Z", "end": "2026-01-23T11:00:00Z" }, "ring": "R2_SHORT", "direction": "UPSTREAM", "score": 0.89, "evidence": [ { "type": "correlation_stats", "key": "kafka_producer_latency_ms", "value": "pearson=0.87 spearman=0.91 cross_lag=-2 suspicion=0.89" }, { "type": "correlation_reason", "key": "kafka_producer_latency_ms", "value": "high_pearson_correlation" }, { "type": "correlation_reason", "key": "kafka_producer_latency_ms", "value": "temporal_precedence" } ], "summary": "Upstream component kafka_producer_latency_ms caused payment-service degradation (0.89 suspicion, pearson=0.87)" }, { "why": 4, "service": "kafka", "kpiName": "kafka_broker_connection_errors", "timeRange": { "start": "2026-01-23T10:00:00Z", "end": "2026-01-23T11:00:00Z" }, "ring": "R2_SHORT", "direction": "UPSTREAM", "score": 0.92, "evidence": [ { "type": "correlation_stats", "key": "kafka_broker_connection_errors", "value": "pearson=0.93 spearman=0.89 cross_lag=-5 suspicion=0.92" }, { "type": "correlation_reason", "key": "kafka_broker_connection_errors", "value": "high_pearson_correlation" }, { "type": "correlation_reason", "key": "kafka_broker_connection_errors", "value": "upstream_component" } ], "summary": "Upstream component kafka_broker_connection_errors caused payment-service degradation (0.92 suspicion, pearson=0.93)" } ], "score": 0.91, "confidence": 0.91 } ], "generatedAt": "2026-01-23T11:01:30Z", "score": 0.91, "notes": [], "diagnostics": {}, "timeRings": { "definitions": { "R1_IMMEDIATE": { "label": "R1_IMMEDIATE", "description": "Anomalies very close to the peak", "duration": "5s" }, "R2_SHORT": { "label": "R2_SHORT", "description": "Anomalies shortly before peak", "duration": "30s" }, "R3_MEDIUM": { "label": "R3_MEDIUM", "description": "Anomalies moderately before peak", "duration": "2m" }, "R4_LONG": { "label": "R4_LONG", "description": "Anomalies further back", "duration": "10m" } }, "perChain": [] } }, "timestamp": "2026-01-23T11:01:30Z" } ``` ### Understanding RCA Output **Impact Section:** - **impactService**: Which service was affected - **metricName**: What metric degraded - **impactSummary**: Human-readable summary with statistical evidence - **severity**: Confidence score (0-1) **Chains (5-WHY Chains):** Each chain represents one possible root cause path: **Step Structure:** - **why**: Step number (1-5) - **service**: Service involved at this step - **kpiName**: Metric/KPI at this step - **ring**: Temporal ring (when it occurred) - **direction**: `SAME` (impact), `UPSTREAM` (cause), `DOWNSTREAM` (effect) - **score**: Confidence for this step - **evidence**: Statistical and correlation evidence - **summary**: Human-readable explanation **Chain Scoring:** - Weighted average: earlier steps (WHY 1-2) weighted higher - Higher score = more confident root cause path - Multiple chains = multiple possible root causes (sorted by score) **Time Rings:** - **R1_IMMEDIATE** (5s): Events very close to peak - **R2_SHORT** (30s): Events shortly before peak - **R3_MEDIUM** (2m): Events moderately before peak - **R4_LONG** (10m): Events further back Rings help identify temporal ordering (cause precedes effect). ### Interpreting a 5-WHY Chain Example chain interpretation: ``` WHY 1 (Business Impact): "Payment processing failed" → What the user experienced → Business/revenue impact WHY 2 (Entry Service): "payment-service degraded" → Which service exhibited the problem → Where the impact manifested WHY 3 (Direct Cause): "kafka_producer_latency_ms increased" → Immediate technical cause → Component directly affecting service WHY 4 (Upstream Cause): "kafka_broker_connection_errors occurred" → Root infrastructure issue → What actually triggered the cascade WHY 5 (Optional): Further upstream causes if available ``` ### Low Confidence RCA If no correlation data is available: **Request:** ```json { "startTime": "2026-01-01T00:00:00Z", "endTime": "2026-01-01T00:05:00Z" } ``` **Response:** ```json { "status": "success", "data": { "impact": { "id": "incident_unknown", "impactService": "unknown", "metricName": "unknown", "impactSummary": "No correlation data for window 2026-01-01 00:00:00 +0000 UTC - 2026-01-01 00:05:00 +0000 UTC", "severity": 0 }, "chains": [], "score": 0, "notes": ["Correlation produced no candidates; returning low-confidence RCA"] } } ``` This indicates: - No KPIs were found with degradation in this window - Or KPI registry is empty - Or VictoriaMetrics/Logs/Traces have no data for this period **Resolution:** 1. Verify KPIs are defined 2. Check time window contains actual incidents 3. Ensure telemetry data exists in VictoriaMetrics/Logs/Traces --- ## Complete Workflows ### Workflow 1: Full Incident Investigation **Scenario**: Payment processing outage on Jan 23, 2026 between 10:00-11:00 **Step 1: Define KPIs (if not already done)** ```bash # Define impact KPI POST /api/v1/kpi/defs { "kpiDefinition": { "name": "payment_processing_errors", "layer": "impact", "sentiment": "negative", "signalType": "metrics", "formula": "sum(rate(payment_errors_total[1m]))", "description": "Failed payment transactions" } } # Define cause KPIs POST /api/v1/kpi/defs { "kpiDefinition": { "name": "kafka_producer_latency_ms", "layer": "cause", "sentiment": "negative", "signalType": "metrics", "formula": "histogram_quantile(0.99, kafka_producer_latency_bucket)", "description": "Kafka producer p99 latency" } } POST /api/v1/kpi/defs { "kpiDefinition": { "name": "database_connection_pool_exhausted", "layer": "cause", "sentiment": "negative", "signalType": "metrics", "formula": "db_pool_active / db_pool_max > 0.95", "description": "Database connection pool near capacity" } } ``` **Step 2: Detect Failures** ```bash POST /api/v1/unified/failures/detect { "time_range": { "start": "2026-01-23T10:00:00Z", "end": "2026-01-23T11:00:00Z" } } ``` **Result**: Identified failures in: - payment-service (kafka component) - database-service (connection pool) - api-gateway (timeouts) **Step 3: Run Correlation Analysis** ```bash POST /api/v1/unified/correlation { "startTime": "2026-01-23T10:00:00Z", "endTime": "2026-01-23T11:00:00Z" } ``` **Result**: Correlation found: - payment_processing_errors (impact) - kafka_producer_latency_ms (cause, suspicion=0.89) - database_connection_pool_exhausted (cause, suspicion=0.75) **Step 4: Run RCA** ```bash POST /api/v1/unified/rca { "startTime": "2026-01-23T10:00:00Z", "endTime": "2026-01-23T11:00:00Z" } ``` **Result**: RCA chain: 1. WHY 1: Payment processing failed (business impact) 2. WHY 2: payment-service degraded (errors increased) 3. WHY 3: kafka_producer_latency_ms spiked (direct cause) 4. WHY 4: kafka_broker_connection_errors occurred (root cause) **Conclusion**: Kafka broker connection failures caused producer latency, leading to payment processing errors. ### Workflow 2: Proactive Monitoring Setup **Scenario**: Set up monitoring for a new microservice **Step 1: Bulk Import KPIs** ```bash # Create kpis.csv name,layer,sentiment,signalType,formula,description user_service_api_latency_p99,impact,negative,metrics,histogram_quantile(0.99\\, rate(http_request_duration_seconds_bucket{service=\"user-service\"}[5m])),API latency affecting users user_service_error_rate,impact,negative,metrics,sum(rate(http_requests_total{service=\"user-service\"\\,status=~\"5..\"}[1m])) / sum(rate(http_requests_total{service=\"user-service\"}[1m])),Error rate impacting reliability user_service_cpu_usage,cause,negative,metrics,avg(rate(process_cpu_seconds_total{service=\"user-service\"}[5m])) * 100,CPU utilization user_service_memory_usage,cause,negative,metrics,process_resident_memory_bytes{service=\"user-service\"} / 1024 / 1024,Memory usage in MB user_service_db_query_latency,cause,negative,metrics,histogram_quantile(0.95\\, rate(db_query_duration_seconds_bucket{service=\"user-service\"}[5m])),Database query latency # Import POST /api/v1/kpi/defs/bulk-csv < kpis.csv ``` **Step 2: Verify KPIs** ```bash GET /api/v1/kpi/defs?tags=user-service&limit=10 ``` **Step 3: Test Correlation** ```bash # Run correlation for last hour POST /api/v1/unified/correlation { "startTime": "2026-01-23T10:00:00Z", "endTime": "2026-01-23T11:00:00Z" } ``` **Step 4: Schedule Periodic RCA** Set up a cron job or monitoring system to: ```bash # Run RCA every 15 minutes for the last 15 minutes */15 * * * * curl -X POST http://mirador-core:8010/api/v1/unified/rca \ -H "Content-Type: application/json" \ -d "{\"startTime\":\"$(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%SZ)\",\"endTime\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" ``` ### Workflow 3: Historical Incident Analysis **Scenario**: Analyze pattern of failures over past week **Step 1: List Failures** ```bash POST /api/v1/unified/failures/list { "limit": 100, "offset": 0, "filters": { "start_time": "2026-01-16T00:00:00Z", "end_time": "2026-01-23T00:00:00Z", "severity": "high" } } ``` **Step 2: Analyze Each Failure** ```bash # For each failure_uuid from step 1 POST /api/v1/unified/failures/get { "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2" } ``` **Step 3: Run RCA for Each Incident** ```bash # For each incident time window POST /api/v1/unified/rca { "startTime": "", "endTime": "" } ``` **Step 4: Aggregate Patterns** Analyze RCA results to find: - Common root causes (e.g., kafka broker issues appearing in 80% of incidents) - Frequently affected services (e.g., payment-service) - Temporal patterns (e.g., failures cluster around 10 AM) --- ## API Reference ### Base URL ``` http://localhost:8010 ``` ### Authentication Mirador Core is designed to run behind an external API gateway or service mesh that handles authentication. No built-in authentication is required for API calls. ### Content Type All requests must include: ``` Content-Type: application/json ``` ### KPI Endpoints | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/v1/kpi/defs` | List KPI definitions | | POST | `/api/v1/kpi/defs` | Create/update KPI definition | | GET | `/api/v1/kpi/defs/{id}` | Get single KPI definition | | DELETE | `/api/v1/kpi/defs/{id}` | Delete KPI definition | | POST | `/api/v1/kpi/defs/bulk-json` | Bulk import KPIs (JSON) | | POST | `/api/v1/kpi/defs/bulk-csv` | Bulk import KPIs (CSV) | | POST | `/api/v1/kpi/search` | Semantic search for KPIs | ### Failure Detection Endpoints | Method | Endpoint | Description | |--------|----------|-------------| | POST | `/api/v1/unified/failures/detect` | Detect component failures | | POST | `/api/v1/unified/failures/correlate` | Correlate transaction failures | | POST | `/api/v1/unified/failures/list` | List stored failures | | POST | `/api/v1/unified/failures/get` | Get failure details | | POST | `/api/v1/unified/failures/delete` | Delete failure record | ### Correlation Endpoints | Method | Endpoint | Description | |--------|----------|-------------| | POST | `/api/v1/unified/correlation` | Run correlation analysis | ### RCA Endpoints | Method | Endpoint | Description | |--------|----------|-------------| | POST | `/api/v1/unified/rca` | Compute root cause analysis | ### Internal Endpoints | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/health` | Health check | | GET | `/ready` | Readiness check | | GET | `/metrics` | Prometheus metrics | | GET | `/api/v1/health` | API v1 health | | GET | `/api/openapi.yaml` | OpenAPI spec (YAML) | | GET | `/api/openapi.json` | OpenAPI spec (JSON) | | GET | `/swagger/index.html` | Swagger UI | ### Query Parameters **KPI List Filtering:** - `limit` (int): Max results (default: 10, max: 10000) - `offset` (int): Pagination offset (default: 0) - `tags` (string[]): Filter by tags (comma-separated) - `layer` (string): Filter by layer (impact/cause) - `sentiment` (string): Filter by sentiment (positive/negative/neutral) - `signalType` (string): Filter by signal type (metrics/logs/traces/business/synthetic) - `kind` (string): Filter by kind (business/tech) - `classifier` (string): Filter by classifier - `datastore` (string): Filter by datastore **Example:** ``` GET /api/v1/kpi/defs?layer=impact&sentiment=negative&limit=50&tags=critical,payments ``` --- ## Troubleshooting ### Problem: KPIs Not Being Detected by Correlation **Symptoms:** - Correlation returns empty results - RCA shows "No correlation data" **Diagnosis:** ```bash # 1. Check if KPIs exist GET /api/v1/kpi/defs?limit=10 # 2. Check if KPIs have formulas GET /api/v1/kpi/defs/{id} # Verify "formula" or "query" field is populated # 3. Test formula manually against VictoriaMetrics curl "http://victoriametrics:8428/api/v1/query?query=&time=$(date +%s)" ``` **Solutions:** 1. Ensure KPIs have valid `formula` or `query` fields 2. Verify VictoriaMetrics contains data for the formula 3. Check time window overlaps with actual data availability ### Problem: Failure Detection Returns Empty Results **Symptoms:** - `/unified/failures/detect` returns `"total_incidents": 0` **Diagnosis:** ```bash # 1. Check if traces exist curl "http://victoriatraces:10428/api/v1/search?start=&end=&tags={}" # 2. Check if metrics exist curl "http://victoriametrics:8428/api/v1/query_range?query=up&start=&end=&step=60" # 3. Verify time window format # Must be RFC3339 UTC: "2026-01-23T10:00:00Z" ``` **Solutions:** 1. Ensure traces are being ingested to VictoriaTraces 2. Verify error spans have `error=true` tag 3. Check anomaly metrics have `iforest_is_anomaly=true` label 4. Confirm time window contains actual incident data ### Problem: RCA Returns Low Confidence **Symptoms:** - RCA score is 0 or very low - Chains are empty - Notes contain: "Correlation produced no candidates" **Diagnosis:** ```bash # 1. Run correlation first to see what it finds POST /api/v1/unified/correlation { "startTime": "...", "endTime": "..." } # 2. Check correlation response # If empty, diagnose correlation (see above) # 3. Verify KPIs have layer=impact and layer=cause defined GET /api/v1/kpi/defs?layer=impact GET /api/v1/kpi/defs?layer=cause ``` **Solutions:** 1. Define at least one `layer=impact` KPI 2. Define multiple `layer=cause` KPIs 3. Ensure time window contains degradation events 4. Run failure detection first to confirm incidents exist ### Problem: Time Window Validation Errors **Symptoms:** - `400 Bad Request: time window too small` - `413 Payload Too Large: time window too large` **Diagnosis:** ```bash # Check engine configuration curl http://localhost:8010/api/v1/unified/metadata | jq '.engineConfig' ``` **Solutions:** 1. Adjust window to respect `min_window` and `max_window` 2. Default constraints: - `min_window: 1m` - `max_window: 1h` 3. Update `configs/config.yaml` if constraints are too restrictive ### Problem: Weaviate Connection Failures **Symptoms:** - KPI creation fails with "weaviate unavailable" - Failure detection works but failures aren't persisted **Diagnosis:** ```bash # 1. Check Weaviate health curl http://weaviate:8080/v1/.well-known/ready # 2. Check Mirador Core logs docker logs mirador-core | grep -i weaviate # 3. Verify Weaviate is enabled in config cat configs/config.yaml | grep -A5 weaviate ``` **Solutions:** 1. Ensure Weaviate container is running: `docker ps | grep weaviate` 2. Verify network connectivity: `docker network inspect mirador-net` 3. Set `weaviate.enabled: true` in config 4. Restart Mirador Core: `docker restart mirador-core` ### Problem: High Memory Usage **Symptoms:** - Mirador Core container OOM killed - Slow response times **Diagnosis:** ```bash # Check container memory docker stats mirador-core # Check VictoriaMetrics data volume docker exec victoriametrics du -sh /victoria-metrics-data ``` **Solutions:** 1. Increase container memory limit in docker-compose.yml: ```yaml deploy: resources: limits: memory: 4G ``` 2. Reduce `default_query_limit` in config.yaml 3. Enable more aggressive caching (increase TTL) 4. Reduce `max_window` to limit analysis scope ### Problem: Correlation Takes Too Long **Symptoms:** - Correlation requests timeout - High CPU usage during correlation **Diagnosis:** ```bash # Check number of KPIs GET /api/v1/kpi/defs | jq '.total' # Check time window size # Large windows = more data to analyze ``` **Solutions:** 1. Reduce time window (use 15m instead of 1h) 2. Reduce number of active KPIs (archive unused ones) 3. Increase timeout in config: ```yaml database: victoria_metrics: timeout: 60000 # 60 seconds ``` 4. Scale Mirador Core horizontally (add more replicas) ### Getting Help **Logs:** ```bash # Mirador Core logs docker logs mirador-core --tail 100 -f # All services logs docker-compose logs -f ``` **Health Checks:** ```bash # Overall health curl http://localhost:8010/api/v1/health | jq # Service status curl http://localhost:8010/microservices/status | jq ``` **Metrics:** ```bash # Prometheus metrics curl http://localhost:8010/metrics | grep mirador ``` **Support:** - GitHub Issues: https://github.com/mirastacklabs-ai/mirador-core/issues - Documentation: http://localhost:8010/swagger/index.html --- ## Summary This guide covered the four core components of Mirador Core: 1. **KPIs**: Define and manage metrics (foundation) 2. **Failures**: Detect and track incidents 3. **Correlation**: Analyze statistical relationships 4. **RCA**: Explain root causes with 5-WHY methodology **Key Takeaways:** - Always define KPIs before running failure detection or RCA - Use impact (`layer=impact`) and cause (`layer=cause`) KPIs for best results - Correlation provides statistical evidence; RCA adds causal reasoning - Time windows must respect configured min/max constraints - All components are interconnected and build upon each other For complete API documentation, see the [Swagger UI](http://localhost:8010/swagger/index.html).