KPI, Failures, Correlation & RCA User Guide

Version: 10.0.0
Last Updated: January 2026
Target Audience: API Consumers & Integration Engineers

Table of Contents

Overview
Prerequisites & Setup
Configuration & Deployment
Component 1: KPI Management
Component 2: Failure Detection
Component 3: Correlation Analysis
Component 4: Root Cause Analysis (RCA)
Complete Workflows
API Reference
Troubleshooting

Overview

What This Guide Covers

This guide explains how to use Mirador Core’s four interconnected observability components:

┌──────────┐     ┌──────────┐     ┌─────────────┐     ┌──────┐
│   KPIs   │ --> │ Failures │ --> │ Correlation │ --> │ RCA  │
└──────────┘     └──────────┘     └─────────────┘     └──────┘
   Define           Detect          Analyze           Explain
   Metrics          Incidents       Patterns          Root Cause

Dependency Chain

Each component builds upon the previous:

KPIs (Key Performance Indicators): Define what metrics to monitor
Failures: Detect incidents based on KPI anomalies and error signals
Correlation: Perform statistical analysis to find relationships between KPIs
RCA (Root Cause Analysis): Use correlation data + 5 WHY methodology to identify root causes

Important: Without KPIs defined, Failure Detection will have limited effectiveness. Without Failures and KPIs, RCA cannot function.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                   Mirador Core API                          │
│                  (localhost:8010)                           │
└─────────────────────────────────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
    ┌───▼────┐      ┌──────▼───────┐     ┌────▼─────┐
    │  KPI   │      │  Correlation │     │   RCA    │
    │  Repo  │      │   Engine     │     │  Engine  │
    └───┬────┘      └──────┬───────┘     └────┬─────┘
        │                  │                   │
        │          ┌───────▼────────┐          │
        │          │ Failure        │          │
        └──────────► Detection      ◄──────────┘
                   └────────┬───────┘
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
   ┌────▼────┐      ┌──────▼───────┐    ┌─────▼─────┐
   │Victoria │      │ Victoria     │    │ Victoria  │
   │ Metrics │      │ Logs         │    │ Traces    │
   └─────────┘      └──────────────┘    └───────────┘

Prerequisites & Setup

System Requirements

Mirador Core:

Go 1.21+ (for building from source)
Docker & Docker Compose (for containerized deployment)
4GB RAM minimum (8GB recommended)
20GB disk space

Required Backend Services:

VictoriaMetrics (metrics storage)
VictoriaLogs (logs storage)
VictoriaTraces (traces storage)
Valkey/Redis (caching)
Weaviate (optional, for KPI vector storage)

Quick Start

Using Docker Compose (Recommended):

# 1. Clone repository
git clone https://github.com/mirastacklabs-ai/mirador-core
cd mirador-core

# 2. Start all services
make localdev-up

# 3. Wait for services to be ready (monitors health checks)
make localdev-wait

# 4. Verify Mirador Core is running
curl http://localhost:8010/api/v1/health

# 5. Seed sample KPIs (optional)
make localdev-seed-data

Expected Output:

{
  "status": "healthy",
  "timestamp": "2026-01-23T10:00:00Z",
  "services": {
    "mirador-core": "ok",
    "victoriametrics": "ok",
    "victorialogs": "ok",
    "victoriatraces": "ok",
    "valkey": "ok"
  }
}

Access Points

Once running, you can access:

Mirador Core API: http://localhost:8010
Swagger UI: http://localhost:8010/swagger/index.html
OpenAPI Spec: http://localhost:8010/api/openapi.yaml
Health Check: http://localhost:8010/health
Prometheus Metrics: http://localhost:8010/metrics

Configuration & Deployment

Configuration File Structure

Mirador Core uses a YAML configuration file located at configs/config.yaml:

# Basic Settings
environment: production  # or development
port: 8010
log_level: info

# VictoriaMetrics Ecosystem
database:
  victoria_metrics:
    endpoints:
      - "http://victoriametrics:8428"
    timeout: 30000
    cluster_mode: false
    
  victoria_logs:
    endpoints:
      - "http://victorialogs:9428"
    timeout: 30000
    
  victoria_traces:
    endpoints:
      - "http://victoriatraces:10428"
    timeout: 30000

# Caching (Valkey/Redis)
cache:
  nodes:
    - "valkey:6379"
  ttl: 300  # 5 minutes default
  password: ""  # Set via env: CACHE_PASSWORD
  db: 0

# CORS (for frontend integration)
cors:
  allowed_origins:
    - "https://your-mirador-ui.com"
  allowed_methods:
    - "GET"
    - "POST"
    - "PUT"
    - "DELETE"

# Weaviate (optional - for KPI vector storage)
weaviate:
  enabled: true
  scheme: "http"
  host: "weaviate"
  port: 8080
  vectorizer:
    provider: "text2vec-transformers"
    model: "sentence-transformers/all-MiniLM-L6-v2"
    use_gpu: false

# Engine Configuration
engine:
  # Time window constraints
  min_window: 1m
  max_window: 1h
  
  # Payload validation
  strict_time_window_payload: true
  strict_time_window: true
  
  # Correlation settings
  correlation_threshold: 0.7
  default_graph_hops: 3
  default_max_whys: 5
  
  # Ring strategy for RCA
  ring_strategy: "default"
  
  # Query limits
  default_query_limit: 1000

Environment Variables

Override configuration via environment variables:

# Database credentials
export VM_PASSWORD="your-victoriametrics-password"
export VL_PASSWORD="your-victorialogs-password"

# Cache credentials
export CACHE_PASSWORD="your-valkey-password"

# Application settings
export PORT=8010
export LOG_LEVEL=info
export ENVIRONMENT=production

# Weaviate connection
export WEAVIATE_HOST=weaviate
export WEAVIATE_PORT=8080

Docker Deployment

Production docker-compose.yml:

version: '3.8'

services:
  mirador-core:
    image: miradorstack/mirador-core:latest
    container_name: mirador-core
    ports:
      - "8010:8010"
    environment:
      - ENVIRONMENT=production
      - LOG_LEVEL=info
      - VM_ENDPOINT=http://victoriametrics:8428
      - VL_ENDPOINT=http://victorialogs:9428
      - VT_ENDPOINT=http://victoriatraces:10428
      - CACHE_NODES=valkey:6379
      - CACHE_PASSWORD=${CACHE_PASSWORD}
      - WEAVIATE_HOST=weaviate
      - WEAVIATE_PORT=8080
    volumes:
      - ./configs:/app/configs:ro
    depends_on:
      - victoriametrics
      - victorialogs
      - victoriatraces
      - valkey
      - weaviate
    restart: unless-stopped
    networks:
      - mirador-net
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8010/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  victoriametrics:
    image: victoriametrics/victoria-metrics:latest
    container_name: victoriametrics
    ports:
      - "8428:8428"
    volumes:
      - vmdata:/victoria-metrics-data
    command:
      - --storageDataPath=/victoria-metrics-data
      - --search.maxUniqueTimeseries=2000000
      - --memory.allowedPercent=90
    networks:
      - mirador-net
    restart: unless-stopped

  victorialogs:
    image: victoriametrics/victoria-logs:latest
    container_name: victorialogs
    ports:
      - "9428:9428"
    volumes:
      - vldata:/victoria-logs-data
    command:
      - --storageDataPath=/victoria-logs-data
    networks:
      - mirador-net
    restart: unless-stopped

  victoriatraces:
    image: victoriametrics/victoria-traces:latest
    container_name: victoriatraces
    ports:
      - "10428:10428"
    volumes:
      - vtdata:/victoria-traces-data
    command:
      - --storageDataPath=/victoria-traces-data
    networks:
      - mirador-net
    restart: unless-stopped

  valkey:
    image: valkey/valkey:latest
    container_name: valkey
    ports:
      - "6379:6379"
    command: >
      valkey-server
      --requirepass ${CACHE_PASSWORD}
      --maxmemory 2gb
      --maxmemory-policy allkeys-lru
    volumes:
      - valkeydata:/data
    networks:
      - mirador-net
    restart: unless-stopped

  weaviate:
    image: semitechnologies/weaviate:latest
    container_name: weaviate
    ports:
      - "8080:8080"
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
      ENABLE_MODULES: 'text2vec-transformers'
      TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
    volumes:
      - weaviatedata:/var/lib/weaviate
    networks:
      - mirador-net
    restart: unless-stopped

  t2v-transformers:
    image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2
    container_name: t2v-transformers
    environment:
      ENABLE_CUDA: '0'
    networks:
      - mirador-net
    restart: unless-stopped

networks:
  mirador-net:
    driver: bridge

volumes:
  vmdata:
  vldata:
  vtdata:
  valkeydata:
  weaviatedata:

Start production environment:

# Export required environment variables
export CACHE_PASSWORD="your-secure-password"

# Start services
docker-compose up -d

# Check logs
docker-compose logs -f mirador-core

# Verify health
curl http://localhost:8010/api/v1/health

Kubernetes Deployment

For Kubernetes deployments, see deployments/k8s/ directory which includes:

Deployment manifests
Service definitions
ConfigMaps and Secrets
Ingress configuration
Horizontal Pod Autoscaler (HPA)

Quick deploy:

# Apply all K8s resources
kubectl apply -f deployments/k8s/

# Check deployment status
kubectl get pods -n mirador

# Port-forward for local access
kubectl port-forward -n mirador svc/mirador-core 8010:8010

Component 1: KPI Management

What are KPIs?

Key Performance Indicators (KPIs) are the foundation of observability in Mirador Core. They define:

What metrics to monitor (e.g., API error rates, latency, transaction volume)
Where to find the data (metrics, logs, traces)
How to query it (formulas, query objects)
Business context (impact layer vs cause layer, sentiment, domain)

KPIs enable:

Registry-driven monitoring: Central source of truth for all monitored signals
Correlation discovery: Automatic relationship detection between metrics
RCA accuracy: Better root cause chains with pre-defined impact/cause layers
Natural language search: Vector-based semantic search over KPI descriptions

KPI Structure

A KPI definition contains:

{
  "id": "kpi-uuid-123",
  "name": "api_errors_total",
  "kind": "tech",
  "layer": "impact",
  "signalType": "metrics",
  "sentiment": "negative",
  "classifier": "errors",
  "datastore": "victoriametrics",
  "queryType": "MetricsQL",
  "unit": "count",
  "format": "integer",
  "formula": "sum(rate(http_requests_total{status=~\"5..\"}[5m]))",
  "definition": "Total API errors at the gateway per minute",
  "businessImpact": "Revenue loss due to failed customer transactions",
  "description": "Tracks API gateway errors - critical service health indicator",
  "tags": ["api", "errors", "critical"],
  "dataType": "timeseries",
  "aggregationWindowHint": "1m",
  "dimensionsHint": ["service.name", "region"],
  "refreshInterval": 60,
  "isShared": true,
  "userId": "user-uuid-456"
}

Required Fields:

name: Unique identifier
layer: impact (business/user-facing) or cause (infrastructure/technical)
sentiment: positive, negative, or neutral
signalType: metrics, logs, traces, business, synthetic

Optional But Recommended:

formula or query: How to fetch the data
businessImpact: Why this metric matters
tags: For filtering and organization

Creating KPIs

Single KPI Creation:

POST /api/v1/kpi/defs
Content-Type: application/json

{
  "kpiDefinition": {
    "name": "payment_processing_errors",
    "kind": "business",
    "layer": "impact",
    "signalType": "metrics",
    "sentiment": "negative",
    "classifier": "errors",
    "datastore": "victoriametrics",
    "queryType": "MetricsQL",
    "unit": "count",
    "format": "integer",
    "formula": "sum(rate(payment_errors_total[1m]))",
    "definition": "Failed payment transactions per minute",
    "businessImpact": "Direct revenue loss - each error = failed customer transaction",
    "description": "Critical business KPI tracking payment processing health",
    "tags": ["payments", "critical", "revenue"],
    "dataType": "timeseries",
    "aggregationWindowHint": "1m",
    "dimensionsHint": ["payment_method", "region"],
    "serviceFamily": "payments",
    "domain": "transactions",
    "refreshInterval": 60,
    "isShared": true
  }
}

Response (201 Created):

{
  "status": "created",
  "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}

Using Query Object (Alternative to Formula):

{
  "kpiDefinition": {
    "name": "database_latency_p99",
    "kind": "tech",
    "layer": "cause",
    "signalType": "metrics",
    "sentiment": "negative",
    "queryType": "MetricsQL",
    "query": {
      "metric": "db_query_duration_seconds",
      "aggregation": "quantile",
      "quantile": 0.99,
      "window": "5m"
    },
    "unit": "seconds",
    "format": "float",
    "definition": "99th percentile database query latency",
    "description": "Tracks database performance degradation",
    "tags": ["database", "latency", "performance"]
  }
}

Bulk KPI Import

JSON Bulk Import:

POST /api/v1/kpi/defs/bulk-json
Content-Type: application/json

{
  "kpiDefinitions": [
    {
      "name": "api_latency_p95",
      "layer": "impact",
      "sentiment": "negative",
      "signalType": "metrics",
      "formula": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
      "description": "API 95th percentile latency"
    },
    {
      "name": "kafka_consumer_lag",
      "layer": "cause",
      "sentiment": "negative",
      "signalType": "metrics",
      "formula": "sum(kafka_consumer_lag_max) by (consumer_group)",
      "description": "Kafka consumer lag indicating processing delays"
    }
  ]
}

CSV Bulk Import:

POST /api/v1/kpi/defs/bulk-csv
Content-Type: text/csv

name,layer,sentiment,signalType,formula,description
cpu_usage_percent,cause,negative,metrics,avg(rate(node_cpu_seconds_total[5m])) * 100,CPU utilization percentage
memory_usage_percent,cause,negative,metrics,100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes),Memory utilization percentage
disk_io_wait,cause,negative,metrics,rate(node_disk_io_time_seconds_total[5m]),Disk I/O wait time

Listing and Filtering KPIs

Get all KPIs (paginated):

GET /api/v1/kpi/defs?limit=10&offset=0

Filter by layer:

GET /api/v1/kpi/defs?layer=impact&limit=50

Filter by tags:

GET /api/v1/kpi/defs?tags=critical,payments

Filter by multiple criteria:

GET /api/v1/kpi/defs?layer=cause&sentiment=negative&signalType=metrics&classifier=latency

Response:

{
  "kpiDefinitions": [
    {
      "id": "kpi-uuid-1",
      "name": "api_errors_total",
      "layer": "impact",
      "sentiment": "negative",
      "description": "Total API gateway errors"
    }
  ],
  "total": 150,
  "nextOffset": 10
}

Searching KPIs (Natural Language)

Mirador Core supports vector-based semantic search over KPI descriptions:

POST /api/v1/kpi/search
Content-Type: application/json

{
  "query": "payment transaction failures affecting revenue",
  "limit": 5
}

Response:

{
  "results": [
    {
      "id": "kpi-uuid-123",
      "name": "payment_processing_errors",
      "description": "Failed payment transactions per minute",
      "score": 0.92
    },
    {
      "id": "kpi-uuid-456",
      "name": "transaction_timeout_total",
      "description": "Payment transactions timing out",
      "score": 0.85
    }
  ]
}

Retrieving a Single KPI

GET /api/v1/kpi/defs/{id}

Example:

GET /api/v1/kpi/defs/f47ac10b-58cc-4372-a567-0e02b2c3d479

Response:

{
  "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  "name": "payment_processing_errors",
  "kind": "business",
  "layer": "impact",
  "signalType": "metrics",
  "sentiment": "negative",
  "formula": "sum(rate(payment_errors_total[1m]))",
  "definition": "Failed payment transactions per minute",
  "businessImpact": "Direct revenue loss",
  "description": "Critical business KPI tracking payment processing health",
  "tags": ["payments", "critical", "revenue"],
  "createdAt": "2026-01-20T10:00:00Z",
  "updatedAt": "2026-01-20T10:00:00Z"
}

Updating a KPI

Updates use the same endpoint as creation (upsert behavior):

POST /api/v1/kpi/defs
Content-Type: application/json

{
  "kpiDefinition": {
    "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "name": "payment_processing_errors",
    "formula": "sum(rate(payment_errors_total[5m]))",  # Changed from 1m to 5m
    "description": "Updated: 5-minute aggregation window"
  }
}

Response (200 OK):

{
  "status": "ok",
  "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}

Deleting a KPI

DELETE /api/v1/kpi/defs/{id}

Example:

DELETE /api/v1/kpi/defs/f47ac10b-58cc-4372-a567-0e02b2c3d479

Response (200 OK):

{
  "status": "deleted",
  "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}

Component 2: Failure Detection

What is Failure Detection?

Failure Detection analyzes telemetry data (metrics, logs, traces) within a time window to identify component failures. It:

Detects error spans: Traces with error=true tag
Identifies anomalies: Metrics with iforest_is_anomaly=true (from AI anomaly detection)
Groups by service+component: Aggregates failures per component
Generates unique IDs: Deterministic UUIDs for deduplication
Persists to storage: Saves failures to Weaviate for historical analysis

When to Use Failure Detection

Incident investigation: “What failed between 10:00 and 11:00?”
Post-mortem analysis: Identify all affected components during an outage
Pre-RCA preparation: Gather failure candidates before running RCA
Monitoring dashboards: Track failure trends over time

Detecting Failures

Basic Detection (All Components):

POST /api/v1/unified/failures/detect
Content-Type: application/json

{
  "time_range": {
    "start": "2026-01-23T10:00:00Z",
    "end": "2026-01-23T11:00:00Z"
  }
}

Filtered Detection (Specific Components):

POST /api/v1/unified/failures/detect
Content-Type: application/json

{
  "time_range": {
    "start": "2026-01-23T10:00:00Z",
    "end": "2026-01-23T11:00:00Z"
  },
  "components": ["kafka", "cassandra", "api-gateway"],
  "services": ["payment-service", "auth-service"]
}

Response:

{
  "incidents": [
    {
      "incident_id": "incident_kafka_1737626400",
      "failure_id": "kafka-producer-kafka-20260123-100000",
      "failure_uuid": "e458d90f-f525-58a9-9e92-9f91faa73cf2",
      "time_range": {
        "start": "2026-01-23T10:00:00Z",
        "end": "2026-01-23T10:15:00Z"
      },
      "primary_component": "kafka",
      "affected_transaction_ids": ["txn-123", "txn-456"],
      "services_involved": ["payment-service", "order-service"],
      "failure_mode": "error_spans",
      "confidence": 0.92,
      "severity": "high"
    }
  ],
  "summary": {
    "total_incidents": 3,
    "time_range": {
      "start": "2026-01-23T10:00:00Z",
      "end": "2026-01-23T11:00:00Z"
    },
    "service_component_summaries": [
      {
        "service": "payment-service",
        "component": "kafka",
        "failure_id": "kafka-producer-kafka-20260123-100000",
        "failure_uuid": "e458d90f-f525-58a9-9e92-9f91faa73cf2",
        "failure_count": 42,
        "affected_transactions": 15,
        "average_anomaly_score": 0.87,
        "average_confidence": 0.92,
        "error_spans_count": 38,
        "error_metrics_count": 4,
        "last_failure_timestamp": "2026-01-23T10:14:32Z"
      }
    ],
    "metrics_error_summary": {
      "total_error_metrics": 12,
      "total_anomaly_metrics": 8,
      "error_metrics_by_name": [
        {
          "metric_name": "kafka_producer_errors_total",
          "count": 8,
          "average_value": 3.5,
          "last_timestamp": "2026-01-23T10:14:32Z"
        }
      ],
      "anomaly_metrics_by_name": [
        {
          "metric_name": "kafka_producer_latency_ms",
          "count": 5,
          "average_value": 1250.3,
          "last_timestamp": "2026-01-23T10:14:28Z"
        }
      ]
    }
  }
}

Understanding Failure IDs

Each failure gets two identifiers:

failure_id (human-readable): {service}-{component}-{YYYYMMDD-HHMMSS}
- Example: kafka-producer-kafka-20260123-100000
- Easy to read in logs and dashboards
failure_uuid (deterministic UUID v5): e458d90f-f525-58a9-9e92-9f91faa73cf2
- Unique identifier for storage and deduplication
- Generated from: service + component + timestamp
- Same failure detected multiple times = same UUID

Transaction Failure Correlation

Correlate failures for specific transaction IDs to track cascading failures:

POST /api/v1/unified/failures/correlate
Content-Type: application/json

{
  "transactionIDs": ["txn-12345", "txn-67890"],
  "time_range": {
    "start": "2026-01-23T10:00:00Z",
    "end": "2026-01-23T11:00:00Z"
  }
}

Response:

{
  "incidents": [
    {
      "incident_id": "incident_txn_12345",
      "failure_id": "transaction-correlation-20260123-100000",
      "affected_transaction_ids": ["txn-12345"],
      "services_involved": ["api-gateway", "payment-service", "kafka"],
      "failure_sequence": [
        {
          "service": "api-gateway",
          "component": "http-server",
          "timestamp": "2026-01-23T10:00:05Z",
          "error": "timeout waiting for payment-service"
        },
        {
          "service": "payment-service",
          "component": "kafka-producer",
          "timestamp": "2026-01-23T10:00:04Z",
          "error": "kafka broker unavailable"
        }
      ],
      "root_component": "kafka",
      "confidence": 0.95
    }
  ]
}

Listing Stored Failures

Retrieve paginated list of historical failures:

POST /api/v1/unified/failures/list
Content-Type: application/json

{
  "limit": 20,
  "offset": 0,
  "filters": {
    "severity": "high",
    "component": "kafka"
  }
}

Response:

{
  "failures": [
    {
      "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2",
      "failure_id": "kafka-producer-kafka-20260123-100000",
      "summary": "Kafka producer failures in payment-service",
      "severity": "high",
      "timestamp": "2026-01-23T10:00:00Z",
      "component": "kafka",
      "service": "payment-service"
    }
  ],
  "total": 150,
  "nextOffset": 20
}

Getting Failure Details

Retrieve full failure record with all signals and metadata:

POST /api/v1/unified/failures/get
Content-Type: application/json

{
  "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2"
}

Response:

{
  "failure": {
    "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2",
    "failure_id": "kafka-producer-kafka-20260123-100000",
    "time_range": {
      "start": "2026-01-23T10:00:00Z",
      "end": "2026-01-23T10:15:00Z"
    },
    "primary_component": "kafka",
    "services": "payment-service",
    "affected_transaction_count": 15,
    "error_signals": [
      {
        "timestamp": "2026-01-23T10:00:05Z",
        "service": "payment-service",
        "component": "kafka-producer",
        "error_message": "broker unavailable",
        "trace_id": "abc123",
        "span_id": "def456"
      }
    ],
    "anomaly_signals": [
      {
        "timestamp": "2026-01-23T10:00:03Z",
        "metric_name": "kafka_producer_latency_ms",
        "value": 1250.3,
        "is_anomaly": true,
        "anomaly_score": 0.92
      }
    ],
    "metadata": {
      "detection_timestamp": "2026-01-23T10:16:00Z",
      "detection_confidence": 0.92,
      "total_signals": 42
    }
  }
}

Deleting a Failure

Remove a failure record from storage:

POST /api/v1/unified/failures/delete
Content-Type: application/json

{
  "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2"
}

Response:

{
  "status": "deleted",
  "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2"
}

Component 3: Correlation Analysis

What is Correlation Analysis?

Correlation Analysis performs statistical analysis between KPIs to discover relationships and patterns. It:

Builds temporal rings: Divides time window into rings (R1: immediate, R2: short, R3: medium, R4: long)
Discovers impact KPIs: Identifies metrics showing degradation (red anchors)
Finds candidate causes: Detects correlated metrics that may explain the impact
Computes statistics: Pearson, Spearman, cross-correlation, partial correlation
Scores candidates: Assigns suspicion scores based on statistical strength

Correlation vs RCA

Aspect	Correlation	RCA
Purpose	Find statistical relationships	Explain root cause
Output	List of correlated KPIs with scores	5-WHY chains with narrative
Method	Statistical analysis	Correlation + reasoning
Use Case	“What else changed?”	“Why did it fail?”

Key Insight: RCA uses Correlation results internally, then adds causal reasoning.

Running Correlation Analysis

Time-Window Correlation (Recommended):

POST /api/v1/unified/correlation
Content-Type: application/json

{
  "startTime": "2026-01-23T10:00:00Z",
  "endTime": "2026-01-23T11:00:00Z"
}

Response:

{
  "status": "success",
  "result": {
    "correlationID": "corr_1737626400",
    "timeRange": {
      "start": "2026-01-23T10:00:00Z",
      "end": "2026-01-23T11:00:00Z"
    },
    "rings": {
      "R1_IMMEDIATE": {
        "label": "R1_IMMEDIATE",
        "description": "Anomalies very close to the peak",
        "duration": "5s",
        "start": "2026-01-23T10:55:55Z",
        "end": "2026-01-23T11:00:00Z"
      },
      "R2_SHORT": {
        "label": "R2_SHORT",
        "description": "Anomalies shortly before peak",
        "duration": "30s",
        "start": "2026-01-23T10:55:25Z",
        "end": "2026-01-23T10:55:55Z"
      },
      "R3_MEDIUM": {
        "label": "R3_MEDIUM",
        "description": "Anomalies moderately before peak",
        "duration": "2m",
        "start": "2026-01-23T10:53:25Z",
        "end": "2026-01-23T10:55:25Z"
      },
      "R4_LONG": {
        "label": "R4_LONG",
        "description": "Anomalies further back",
        "duration": "10m",
        "start": "2026-01-23T10:43:25Z",
        "end": "2026-01-23T10:53:25Z"
      }
    },
    "affectedServices": ["payment-service", "kafka"],
    "redAnchors": [
      {
        "service": "payment-service",
        "metric": "payment_processing_errors",
        "score": 0.95,
        "ring": "R1_IMMEDIATE",
        "labelFingerprint": {
          "service.name": "payment-service",
          "region": "us-east-1"
        }
      }
    ],
    "causes": [
      {
        "kpi": "kafka_producer_latency_ms",
        "service": "payment-service",
        "suspicionScore": 0.89,
        "ring": "R2_SHORT",
        "reasons": [
          "high_pearson_correlation",
          "high_spearman_correlation",
          "temporal_precedence",
          "high_anomaly_density"
        ],
        "stats": {
          "pearson": 0.87,
          "spearman": 0.91,
          "crossCorrMax": 0.88,
          "crossCorrLag": -2,
          "partial": 0.82
        },
        "labelFingerprint": {
          "service.name": "payment-service",
          "component": "kafka-producer"
        }
      },
      {
        "kpi": "kafka_broker_connection_errors",
        "service": "kafka",
        "suspicionScore": 0.92,
        "ring": "R3_MEDIUM",
        "reasons": [
          "high_pearson_correlation",
          "temporal_precedence",
          "upstream_component"
        ],
        "stats": {
          "pearson": 0.93,
          "spearman": 0.89,
          "crossCorrMax": 0.91,
          "crossCorrLag": -5,
          "partial": 0.85
        }
      }
    ],
    "confidence": 0.91,
    "createdAt": "2026-01-23T11:01:00Z"
  }
}

Understanding Correlation Results

Red Anchors:

Metrics showing impact (business/user-facing degradation)
Typically from KPIs with layer=impact
High scores = strong impact signal

Cause Candidates:

Metrics that correlate with red anchors
Typically from KPIs with layer=cause
Ranked by suspicion score (higher = more suspicious)

Statistics Explained:

Metric	Meaning	Range
Pearson	Linear correlation strength	-1 to +1
Spearman	Rank correlation (monotonic relationship)	-1 to +1
CrossCorrMax	Maximum cross-correlation	-1 to +1
CrossCorrLag	Time lag (negative = cause precedes impact)	seconds
Partial	Correlation after removing confounders	-1 to +1

Suspicion Score Calculation:

suspicionScore = weighted_average(
  pearson_correlation,
  spearman_correlation,
  cross_correlation_max,
  partial_correlation,
  anomaly_density,
  temporal_precedence
)

Reasons (Why This Candidate is Suspicious):

high_pearson_correlation: Strong linear relationship
high_spearman_correlation: Strong monotonic relationship
temporal_precedence: Cause occurred before impact
high_anomaly_density: Many anomalies in this metric
upstream_component: Component is upstream in service graph

Time Window Constraints

Correlation enforces time window limits from configuration:

engine:
  min_window: 1m   # Minimum analysis window
  max_window: 1h   # Maximum analysis window

Invalid Windows:

# Too small
{"startTime": "2026-01-23T10:00:00Z", "endTime": "2026-01-23T10:00:30Z"}
# Error: time window too small: 30s < minWindow 1m

# Too large
{"startTime": "2026-01-23T00:00:00Z", "endTime": "2026-01-23T23:59:59Z"}
# Error: time window too large: 23h59m59s > maxWindow 1h

Component 4: Root Cause Analysis (RCA)

What is RCA?

Root Cause Analysis (RCA) combines:

Correlation results (statistical relationships)
5 WHY methodology (iterative questioning)
Service topology (upstream/downstream relationships)
Temporal rings (time-based evidence)

To produce human-readable explanations of why incidents occurred.

RCA Process Flow

1. User provides time window
        ↓
2. RCA engine calls Correlation engine
        ↓
3. Correlation returns:
   - Red anchors (impacts)
   - Cause candidates (correlated metrics)
   - Statistical evidence
        ↓
4. RCA builds 5-WHY chains:
   - WHY 1: Business impact (what failed?)
   - WHY 2: Entry service degradation
   - WHY 3-5: Upstream causes (evidence-driven)
        ↓
5. Returns structured narrative with:
   - Impact summary
   - Causal chains
   - Time rings
   - Diagnostic details

Running RCA

Basic RCA Request:

POST /api/v1/unified/rca
Content-Type: application/json

{
  "startTime": "2026-01-23T10:00:00Z",
  "endTime": "2026-01-23T11:00:00Z"
}

Response:

{
  "status": "success",
  "data": {
    "impact": {
      "id": "incident_payment_service_1737626400",
      "impactService": "payment-service",
      "metricName": "payment_processing_errors",
      "timeStart": "2026-01-23T10:00:00Z",
      "timeEnd": "2026-01-23T11:00:00Z",
      "impactSummary": "Impact detected on payment-service (correlation confidence 0.91). Top-candidate kafka_producer_latency_ms: pearson=0.87 spearman=0.91 partial=0.82 cross_max=0.88 lag=-2 anomalies=HIGH",
      "severity": 0.91
    },
    "chains": [
      {
        "steps": [
          {
            "why": 1,
            "service": "payment-service",
            "kpiName": "payment_processing_errors",
            "timeRange": {
              "start": "2026-01-23T10:00:00Z",
              "end": "2026-01-23T11:00:00Z"
            },
            "ring": "R1_IMMEDIATE",
            "direction": "SAME",
            "score": 0.91,
            "evidence": [
              {
                "type": "red_anchor",
                "key": "payment-service",
                "value": "anchor_score=0.950"
              }
            ],
            "summary": "Payment processing failed: payment_processing_errors increased dramatically in R1_IMMEDIATE (0.91 confidence)"
          },
          {
            "why": 2,
            "service": "payment-service",
            "kpiName": "payment_processing_errors",
            "timeRange": {
              "start": "2026-01-23T10:00:00Z",
              "end": "2026-01-23T11:00:00Z"
            },
            "ring": "R1_IMMEDIATE",
            "direction": "UPSTREAM",
            "score": 0.95,
            "evidence": [
              {
                "type": "red_anchor",
                "key": "payment-service",
                "value": "metric=payment_processing_errors score=0.950"
              }
            ],
            "summary": "payment-service degraded: payment_processing_errors showed anomalies in R1_IMMEDIATE (0.95 confidence)"
          },
          {
            "why": 3,
            "service": "payment-service",
            "kpiName": "kafka_producer_latency_ms",
            "timeRange": {
              "start": "2026-01-23T10:00:00Z",
              "end": "2026-01-23T11:00:00Z"
            },
            "ring": "R2_SHORT",
            "direction": "UPSTREAM",
            "score": 0.89,
            "evidence": [
              {
                "type": "correlation_stats",
                "key": "kafka_producer_latency_ms",
                "value": "pearson=0.87 spearman=0.91 cross_lag=-2 suspicion=0.89"
              },
              {
                "type": "correlation_reason",
                "key": "kafka_producer_latency_ms",
                "value": "high_pearson_correlation"
              },
              {
                "type": "correlation_reason",
                "key": "kafka_producer_latency_ms",
                "value": "temporal_precedence"
              }
            ],
            "summary": "Upstream component kafka_producer_latency_ms caused payment-service degradation (0.89 suspicion, pearson=0.87)"
          },
          {
            "why": 4,
            "service": "kafka",
            "kpiName": "kafka_broker_connection_errors",
            "timeRange": {
              "start": "2026-01-23T10:00:00Z",
              "end": "2026-01-23T11:00:00Z"
            },
            "ring": "R2_SHORT",
            "direction": "UPSTREAM",
            "score": 0.92,
            "evidence": [
              {
                "type": "correlation_stats",
                "key": "kafka_broker_connection_errors",
                "value": "pearson=0.93 spearman=0.89 cross_lag=-5 suspicion=0.92"
              },
              {
                "type": "correlation_reason",
                "key": "kafka_broker_connection_errors",
                "value": "high_pearson_correlation"
              },
              {
                "type": "correlation_reason",
                "key": "kafka_broker_connection_errors",
                "value": "upstream_component"
              }
            ],
            "summary": "Upstream component kafka_broker_connection_errors caused payment-service degradation (0.92 suspicion, pearson=0.93)"
          }
        ],
        "score": 0.91,
        "confidence": 0.91
      }
    ],
    "generatedAt": "2026-01-23T11:01:30Z",
    "score": 0.91,
    "notes": [],
    "diagnostics": {},
    "timeRings": {
      "definitions": {
        "R1_IMMEDIATE": {
          "label": "R1_IMMEDIATE",
          "description": "Anomalies very close to the peak",
          "duration": "5s"
        },
        "R2_SHORT": {
          "label": "R2_SHORT",
          "description": "Anomalies shortly before peak",
          "duration": "30s"
        },
        "R3_MEDIUM": {
          "label": "R3_MEDIUM",
          "description": "Anomalies moderately before peak",
          "duration": "2m"
        },
        "R4_LONG": {
          "label": "R4_LONG",
          "description": "Anomalies further back",
          "duration": "10m"
        }
      },
      "perChain": []
    }
  },
  "timestamp": "2026-01-23T11:01:30Z"
}

Understanding RCA Output

Impact Section:

impactService: Which service was affected
metricName: What metric degraded
impactSummary: Human-readable summary with statistical evidence
severity: Confidence score (0-1)

Chains (5-WHY Chains):

Each chain represents one possible root cause path:

Step Structure:

why: Step number (1-5)
service: Service involved at this step
kpiName: Metric/KPI at this step
ring: Temporal ring (when it occurred)
direction: SAME (impact), UPSTREAM (cause), DOWNSTREAM (effect)
score: Confidence for this step
evidence: Statistical and correlation evidence
summary: Human-readable explanation

Chain Scoring:

Weighted average: earlier steps (WHY 1-2) weighted higher
Higher score = more confident root cause path
Multiple chains = multiple possible root causes (sorted by score)

Time Rings:

R1_IMMEDIATE (5s): Events very close to peak
R2_SHORT (30s): Events shortly before peak
R3_MEDIUM (2m): Events moderately before peak
R4_LONG (10m): Events further back

Rings help identify temporal ordering (cause precedes effect).

Interpreting a 5-WHY Chain

Example chain interpretation:

WHY 1 (Business Impact): "Payment processing failed"
  → What the user experienced
  → Business/revenue impact

WHY 2 (Entry Service): "payment-service degraded"
  → Which service exhibited the problem
  → Where the impact manifested

WHY 3 (Direct Cause): "kafka_producer_latency_ms increased"
  → Immediate technical cause
  → Component directly affecting service

WHY 4 (Upstream Cause): "kafka_broker_connection_errors occurred"
  → Root infrastructure issue
  → What actually triggered the cascade

WHY 5 (Optional): Further upstream causes if available

Low Confidence RCA

If no correlation data is available:

Request:

{
  "startTime": "2026-01-01T00:00:00Z",
  "endTime": "2026-01-01T00:05:00Z"
}

Response:

{
  "status": "success",
  "data": {
    "impact": {
      "id": "incident_unknown",
      "impactService": "unknown",
      "metricName": "unknown",
      "impactSummary": "No correlation data for window 2026-01-01 00:00:00 +0000 UTC - 2026-01-01 00:05:00 +0000 UTC",
      "severity": 0
    },
    "chains": [],
    "score": 0,
    "notes": ["Correlation produced no candidates; returning low-confidence RCA"]
  }
}

This indicates:

No KPIs were found with degradation in this window
Or KPI registry is empty
Or VictoriaMetrics/Logs/Traces have no data for this period

Resolution:

Verify KPIs are defined
Check time window contains actual incidents
Ensure telemetry data exists in VictoriaMetrics/Logs/Traces

Complete Workflows

Workflow 1: Full Incident Investigation

Scenario: Payment processing outage on Jan 23, 2026 between 10:00-11:00

Step 1: Define KPIs (if not already done)

# Define impact KPI
POST /api/v1/kpi/defs
{
  "kpiDefinition": {
    "name": "payment_processing_errors",
    "layer": "impact",
    "sentiment": "negative",
    "signalType": "metrics",
    "formula": "sum(rate(payment_errors_total[1m]))",
    "description": "Failed payment transactions"
  }
}

# Define cause KPIs
POST /api/v1/kpi/defs
{
  "kpiDefinition": {
    "name": "kafka_producer_latency_ms",
    "layer": "cause",
    "sentiment": "negative",
    "signalType": "metrics",
    "formula": "histogram_quantile(0.99, kafka_producer_latency_bucket)",
    "description": "Kafka producer p99 latency"
  }
}

POST /api/v1/kpi/defs
{
  "kpiDefinition": {
    "name": "database_connection_pool_exhausted",
    "layer": "cause",
    "sentiment": "negative",
    "signalType": "metrics",
    "formula": "db_pool_active / db_pool_max > 0.95",
    "description": "Database connection pool near capacity"
  }
}

Step 2: Detect Failures

POST /api/v1/unified/failures/detect
{
  "time_range": {
    "start": "2026-01-23T10:00:00Z",
    "end": "2026-01-23T11:00:00Z"
  }
}

Result: Identified failures in:

payment-service (kafka component)
database-service (connection pool)
api-gateway (timeouts)

Step 3: Run Correlation Analysis

POST /api/v1/unified/correlation
{
  "startTime": "2026-01-23T10:00:00Z",
  "endTime": "2026-01-23T11:00:00Z"
}

Result: Correlation found:

payment_processing_errors (impact)
kafka_producer_latency_ms (cause, suspicion=0.89)
database_connection_pool_exhausted (cause, suspicion=0.75)

Step 4: Run RCA

POST /api/v1/unified/rca
{
  "startTime": "2026-01-23T10:00:00Z",
  "endTime": "2026-01-23T11:00:00Z"
}

Result: RCA chain:

WHY 1: Payment processing failed (business impact)
WHY 2: payment-service degraded (errors increased)
WHY 3: kafka_producer_latency_ms spiked (direct cause)
WHY 4: kafka_broker_connection_errors occurred (root cause)

Conclusion: Kafka broker connection failures caused producer latency, leading to payment processing errors.

Workflow 2: Proactive Monitoring Setup

Scenario: Set up monitoring for a new microservice

Step 1: Bulk Import KPIs

# Create kpis.csv
name,layer,sentiment,signalType,formula,description
user_service_api_latency_p99,impact,negative,metrics,histogram_quantile(0.99\\, rate(http_request_duration_seconds_bucket{service=\"user-service\"}[5m])),API latency affecting users
user_service_error_rate,impact,negative,metrics,sum(rate(http_requests_total{service=\"user-service\"\\,status=~\"5..\"}[1m])) / sum(rate(http_requests_total{service=\"user-service\"}[1m])),Error rate impacting reliability
user_service_cpu_usage,cause,negative,metrics,avg(rate(process_cpu_seconds_total{service=\"user-service\"}[5m])) * 100,CPU utilization
user_service_memory_usage,cause,negative,metrics,process_resident_memory_bytes{service=\"user-service\"} / 1024 / 1024,Memory usage in MB
user_service_db_query_latency,cause,negative,metrics,histogram_quantile(0.95\\, rate(db_query_duration_seconds_bucket{service=\"user-service\"}[5m])),Database query latency

# Import
POST /api/v1/kpi/defs/bulk-csv < kpis.csv

Step 2: Verify KPIs

GET /api/v1/kpi/defs?tags=user-service&limit=10

Step 3: Test Correlation

# Run correlation for last hour
POST /api/v1/unified/correlation
{
  "startTime": "2026-01-23T10:00:00Z",
  "endTime": "2026-01-23T11:00:00Z"
}

Step 4: Schedule Periodic RCA

Set up a cron job or monitoring system to:

# Run RCA every 15 minutes for the last 15 minutes
*/15 * * * * curl -X POST http://mirador-core:8010/api/v1/unified/rca \
  -H "Content-Type: application/json" \
  -d "{\"startTime\":\"$(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%SZ)\",\"endTime\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}"

Workflow 3: Historical Incident Analysis

Scenario: Analyze pattern of failures over past week

Step 1: List Failures

POST /api/v1/unified/failures/list
{
  "limit": 100,
  "offset": 0,
  "filters": {
    "start_time": "2026-01-16T00:00:00Z",
    "end_time": "2026-01-23T00:00:00Z",
    "severity": "high"
  }
}

Step 2: Analyze Each Failure

# For each failure_uuid from step 1
POST /api/v1/unified/failures/get
{
  "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2"
}

Step 3: Run RCA for Each Incident

# For each incident time window
POST /api/v1/unified/rca
{
  "startTime": "<incident_start>",
  "endTime": "<incident_end>"
}

Step 4: Aggregate Patterns

Analyze RCA results to find:

Common root causes (e.g., kafka broker issues appearing in 80% of incidents)
Frequently affected services (e.g., payment-service)
Temporal patterns (e.g., failures cluster around 10 AM)

API Reference

Base URL

http://localhost:8010

Authentication

Mirador Core is designed to run behind an external API gateway or service mesh that handles authentication. No built-in authentication is required for API calls.

Content Type

All requests must include:

Content-Type: application/json

KPI Endpoints

Method	Endpoint	Description
GET	`/api/v1/kpi/defs`	List KPI definitions
POST	`/api/v1/kpi/defs`	Create/update KPI definition
GET	`/api/v1/kpi/defs/{id}`	Get single KPI definition
DELETE	`/api/v1/kpi/defs/{id}`	Delete KPI definition
POST	`/api/v1/kpi/defs/bulk-json`	Bulk import KPIs (JSON)
POST	`/api/v1/kpi/defs/bulk-csv`	Bulk import KPIs (CSV)
POST	`/api/v1/kpi/search`	Semantic search for KPIs

Failure Detection Endpoints

Method	Endpoint	Description
POST	`/api/v1/unified/failures/detect`	Detect component failures
POST	`/api/v1/unified/failures/correlate`	Correlate transaction failures
POST	`/api/v1/unified/failures/list`	List stored failures
POST	`/api/v1/unified/failures/get`	Get failure details
POST	`/api/v1/unified/failures/delete`	Delete failure record

Correlation Endpoints

Method	Endpoint	Description
POST	`/api/v1/unified/correlation`	Run correlation analysis

RCA Endpoints

Method	Endpoint	Description
POST	`/api/v1/unified/rca`	Compute root cause analysis

Internal Endpoints

Method	Endpoint	Description
GET	`/health`	Health check
GET	`/ready`	Readiness check
GET	`/metrics`	Prometheus metrics
GET	`/api/v1/health`	API v1 health
GET	`/api/openapi.yaml`	OpenAPI spec (YAML)
GET	`/api/openapi.json`	OpenAPI spec (JSON)
GET	`/swagger/index.html`	Swagger UI

Query Parameters

KPI List Filtering:

limit (int): Max results (default: 10, max: 10000)
offset (int): Pagination offset (default: 0)
tags (string[]): Filter by tags (comma-separated)
layer (string): Filter by layer (impact/cause)
sentiment (string): Filter by sentiment (positive/negative/neutral)
signalType (string): Filter by signal type (metrics/logs/traces/business/synthetic)
kind (string): Filter by kind (business/tech)
classifier (string): Filter by classifier
datastore (string): Filter by datastore

Example:

GET /api/v1/kpi/defs?layer=impact&sentiment=negative&limit=50&tags=critical,payments

Troubleshooting

Problem: KPIs Not Being Detected by Correlation

Symptoms:

Correlation returns empty results
RCA shows “No correlation data”

Diagnosis:

# 1. Check if KPIs exist
GET /api/v1/kpi/defs?limit=10

# 2. Check if KPIs have formulas
GET /api/v1/kpi/defs/{id}
# Verify "formula" or "query" field is populated

# 3. Test formula manually against VictoriaMetrics
curl "http://victoriametrics:8428/api/v1/query?query=<your_formula>&time=$(date +%s)"

Solutions:

Ensure KPIs have valid formula or query fields
Verify VictoriaMetrics contains data for the formula
Check time window overlaps with actual data availability

Problem: Failure Detection Returns Empty Results

Symptoms:

/unified/failures/detect returns "total_incidents": 0

Diagnosis:

# 1. Check if traces exist
curl "http://victoriatraces:10428/api/v1/search?start=<start_epoch>&end=<end_epoch>&tags={}"

# 2. Check if metrics exist
curl "http://victoriametrics:8428/api/v1/query_range?query=up&start=<start>&end=<end>&step=60"

# 3. Verify time window format
# Must be RFC3339 UTC: "2026-01-23T10:00:00Z"

Solutions:

Ensure traces are being ingested to VictoriaTraces
Verify error spans have error=true tag
Check anomaly metrics have iforest_is_anomaly=true label
Confirm time window contains actual incident data

Problem: RCA Returns Low Confidence

Symptoms:

RCA score is 0 or very low
Chains are empty
Notes contain: “Correlation produced no candidates”

Diagnosis:

# 1. Run correlation first to see what it finds
POST /api/v1/unified/correlation
{
  "startTime": "...",
  "endTime": "..."
}

# 2. Check correlation response
# If empty, diagnose correlation (see above)

# 3. Verify KPIs have layer=impact and layer=cause defined
GET /api/v1/kpi/defs?layer=impact
GET /api/v1/kpi/defs?layer=cause

Solutions:

Define at least one layer=impact KPI
Define multiple layer=cause KPIs
Ensure time window contains degradation events
Run failure detection first to confirm incidents exist

Problem: Time Window Validation Errors

Symptoms:

400 Bad Request: time window too small
413 Payload Too Large: time window too large

Diagnosis:

# Check engine configuration
curl http://localhost:8010/api/v1/unified/metadata | jq '.engineConfig'

Solutions:

Adjust window to respect min_window and max_window
Default constraints:
- min_window: 1m
- max_window: 1h
Update configs/config.yaml if constraints are too restrictive

Problem: Weaviate Connection Failures

Symptoms:

KPI creation fails with “weaviate unavailable”
Failure detection works but failures aren’t persisted

Diagnosis:

# 1. Check Weaviate health
curl http://weaviate:8080/v1/.well-known/ready

# 2. Check Mirador Core logs
docker logs mirador-core | grep -i weaviate

# 3. Verify Weaviate is enabled in config
cat configs/config.yaml | grep -A5 weaviate

Solutions:

Ensure Weaviate container is running: docker ps | grep weaviate
Verify network connectivity: docker network inspect mirador-net
Set weaviate.enabled: true in config
Restart Mirador Core: docker restart mirador-core

Problem: High Memory Usage

Symptoms:

Mirador Core container OOM killed
Slow response times

Diagnosis:

# Check container memory
docker stats mirador-core

# Check VictoriaMetrics data volume
docker exec victoriametrics du -sh /victoria-metrics-data

Solutions:

Increase container memory limit in docker-compose.yml:

deploy:
  resources:
    limits:
      memory: 4G

Reduce default_query_limit in config.yaml
Enable more aggressive caching (increase TTL)
Reduce max_window to limit analysis scope

Problem: Correlation Takes Too Long

Symptoms:

Correlation requests timeout
High CPU usage during correlation

Diagnosis:

# Check number of KPIs
GET /api/v1/kpi/defs | jq '.total'

# Check time window size
# Large windows = more data to analyze

Solutions:

Reduce time window (use 15m instead of 1h)
Reduce number of active KPIs (archive unused ones)

Increase timeout in config:

database:
  victoria_metrics:
    timeout: 60000  # 60 seconds

Scale Mirador Core horizontally (add more replicas)

Getting Help

Logs:

# Mirador Core logs
docker logs mirador-core --tail 100 -f

# All services logs
docker-compose logs -f

Health Checks:

# Overall health
curl http://localhost:8010/api/v1/health | jq

# Service status
curl http://localhost:8010/microservices/status | jq

Metrics:

# Prometheus metrics
curl http://localhost:8010/metrics | grep mirador

Support:

GitHub Issues: https://github.com/mirastacklabs-ai/mirador-core/issues
Documentation: http://localhost:8010/swagger/index.html

Summary

This guide covered the four core components of Mirador Core:

KPIs: Define and manage metrics (foundation)
Failures: Detect and track incidents
Correlation: Analyze statistical relationships
RCA: Explain root causes with 5-WHY methodology

Key Takeaways:

Always define KPIs before running failure detection or RCA
Use impact (layer=impact) and cause (layer=cause) KPIs for best results
Correlation provides statistical evidence; RCA adds causal reasoning
Time windows must respect configured min/max constraints
All components are interconnected and build upon each other

For complete API documentation, see the Swagger UI.