KPI, Failures, Correlation & RCA User Guide

Version: 10.0.0
Last Updated: January 2026
Target Audience: API Consumers & Integration Engineers


Table of Contents

  1. Overview

  2. Prerequisites & Setup

  3. Configuration & Deployment

  4. Component 1: KPI Management

  5. Component 2: Failure Detection

  6. Component 3: Correlation Analysis

  7. Component 4: Root Cause Analysis (RCA)

  8. Complete Workflows

  9. API Reference

  10. Troubleshooting


Overview

What This Guide Covers

This guide explains how to use Mirador Core’s four interconnected observability components:

┌──────────┐     ┌──────────┐     ┌─────────────┐     ┌──────┐
│   KPIs   │ --> │ Failures │ --> │ Correlation │ --> │ RCA  │
└──────────┘     └──────────┘     └─────────────┘     └──────┘
   Define           Detect          Analyze           Explain
   Metrics          Incidents       Patterns          Root Cause

Dependency Chain

Each component builds upon the previous:

  1. KPIs (Key Performance Indicators): Define what metrics to monitor

  2. Failures: Detect incidents based on KPI anomalies and error signals

  3. Correlation: Perform statistical analysis to find relationships between KPIs

  4. RCA (Root Cause Analysis): Use correlation data + 5 WHY methodology to identify root causes

Important: Without KPIs defined, Failure Detection will have limited effectiveness. Without Failures and KPIs, RCA cannot function.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                   Mirador Core API                          │
│                  (localhost:8010)                           │
└─────────────────────────────────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
    ┌───▼────┐      ┌──────▼───────┐     ┌────▼─────┐
    │  KPI   │      │  Correlation │     │   RCA    │
    │  Repo  │      │   Engine     │     │  Engine  │
    └───┬────┘      └──────┬───────┘     └────┬─────┘
        │                  │                   │
        │          ┌───────▼────────┐          │
        │          │ Failure        │          │
        └──────────► Detection      ◄──────────┘
                   └────────┬───────┘
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
   ┌────▼────┐      ┌──────▼───────┐    ┌─────▼─────┐
   │Victoria │      │ Victoria     │    │ Victoria  │
   │ Metrics │      │ Logs         │    │ Traces    │
   └─────────┘      └──────────────┘    └───────────┘

Prerequisites & Setup

System Requirements

Mirador Core:

  • Go 1.21+ (for building from source)

  • Docker & Docker Compose (for containerized deployment)

  • 4GB RAM minimum (8GB recommended)

  • 20GB disk space

Required Backend Services:

  • VictoriaMetrics (metrics storage)

  • VictoriaLogs (logs storage)

  • VictoriaTraces (traces storage)

  • Valkey/Redis (caching)

  • Weaviate (optional, for KPI vector storage)

Quick Start

Using Docker Compose (Recommended):

# 1. Clone repository
git clone https://github.com/mirastacklabs-ai/mirador-core
cd mirador-core

# 2. Start all services
make localdev-up

# 3. Wait for services to be ready (monitors health checks)
make localdev-wait

# 4. Verify Mirador Core is running
curl http://localhost:8010/api/v1/health

# 5. Seed sample KPIs (optional)
make localdev-seed-data

Expected Output:

{
  "status": "healthy",
  "timestamp": "2026-01-23T10:00:00Z",
  "services": {
    "mirador-core": "ok",
    "victoriametrics": "ok",
    "victorialogs": "ok",
    "victoriatraces": "ok",
    "valkey": "ok"
  }
}

Access Points

Once running, you can access:

  • Mirador Core API: http://localhost:8010

  • Swagger UI: http://localhost:8010/swagger/index.html

  • OpenAPI Spec: http://localhost:8010/api/openapi.yaml

  • Health Check: http://localhost:8010/health

  • Prometheus Metrics: http://localhost:8010/metrics


Configuration & Deployment

Configuration File Structure

Mirador Core uses a YAML configuration file located at configs/config.yaml:

# Basic Settings
environment: production  # or development
port: 8010
log_level: info

# VictoriaMetrics Ecosystem
database:
  victoria_metrics:
    endpoints:
      - "http://victoriametrics:8428"
    timeout: 30000
    cluster_mode: false
    
  victoria_logs:
    endpoints:
      - "http://victorialogs:9428"
    timeout: 30000
    
  victoria_traces:
    endpoints:
      - "http://victoriatraces:10428"
    timeout: 30000

# Caching (Valkey/Redis)
cache:
  nodes:
    - "valkey:6379"
  ttl: 300  # 5 minutes default
  password: ""  # Set via env: CACHE_PASSWORD
  db: 0

# CORS (for frontend integration)
cors:
  allowed_origins:
    - "https://your-mirador-ui.com"
  allowed_methods:
    - "GET"
    - "POST"
    - "PUT"
    - "DELETE"

# Weaviate (optional - for KPI vector storage)
weaviate:
  enabled: true
  scheme: "http"
  host: "weaviate"
  port: 8080
  vectorizer:
    provider: "text2vec-transformers"
    model: "sentence-transformers/all-MiniLM-L6-v2"
    use_gpu: false

# Engine Configuration
engine:
  # Time window constraints
  min_window: 1m
  max_window: 1h
  
  # Payload validation
  strict_time_window_payload: true
  strict_time_window: true
  
  # Correlation settings
  correlation_threshold: 0.7
  default_graph_hops: 3
  default_max_whys: 5
  
  # Ring strategy for RCA
  ring_strategy: "default"
  
  # Query limits
  default_query_limit: 1000

Environment Variables

Override configuration via environment variables:

# Database credentials
export VM_PASSWORD="your-victoriametrics-password"
export VL_PASSWORD="your-victorialogs-password"

# Cache credentials
export CACHE_PASSWORD="your-valkey-password"

# Application settings
export PORT=8010
export LOG_LEVEL=info
export ENVIRONMENT=production

# Weaviate connection
export WEAVIATE_HOST=weaviate
export WEAVIATE_PORT=8080

Docker Deployment

Production docker-compose.yml:

version: '3.8'

services:
  mirador-core:
    image: miradorstack/mirador-core:latest
    container_name: mirador-core
    ports:
      - "8010:8010"
    environment:
      - ENVIRONMENT=production
      - LOG_LEVEL=info
      - VM_ENDPOINT=http://victoriametrics:8428
      - VL_ENDPOINT=http://victorialogs:9428
      - VT_ENDPOINT=http://victoriatraces:10428
      - CACHE_NODES=valkey:6379
      - CACHE_PASSWORD=${CACHE_PASSWORD}
      - WEAVIATE_HOST=weaviate
      - WEAVIATE_PORT=8080
    volumes:
      - ./configs:/app/configs:ro
    depends_on:
      - victoriametrics
      - victorialogs
      - victoriatraces
      - valkey
      - weaviate
    restart: unless-stopped
    networks:
      - mirador-net
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8010/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  victoriametrics:
    image: victoriametrics/victoria-metrics:latest
    container_name: victoriametrics
    ports:
      - "8428:8428"
    volumes:
      - vmdata:/victoria-metrics-data
    command:
      - --storageDataPath=/victoria-metrics-data
      - --search.maxUniqueTimeseries=2000000
      - --memory.allowedPercent=90
    networks:
      - mirador-net
    restart: unless-stopped

  victorialogs:
    image: victoriametrics/victoria-logs:latest
    container_name: victorialogs
    ports:
      - "9428:9428"
    volumes:
      - vldata:/victoria-logs-data
    command:
      - --storageDataPath=/victoria-logs-data
    networks:
      - mirador-net
    restart: unless-stopped

  victoriatraces:
    image: victoriametrics/victoria-traces:latest
    container_name: victoriatraces
    ports:
      - "10428:10428"
    volumes:
      - vtdata:/victoria-traces-data
    command:
      - --storageDataPath=/victoria-traces-data
    networks:
      - mirador-net
    restart: unless-stopped

  valkey:
    image: valkey/valkey:latest
    container_name: valkey
    ports:
      - "6379:6379"
    command: >
      valkey-server
      --requirepass ${CACHE_PASSWORD}
      --maxmemory 2gb
      --maxmemory-policy allkeys-lru
    volumes:
      - valkeydata:/data
    networks:
      - mirador-net
    restart: unless-stopped

  weaviate:
    image: semitechnologies/weaviate:latest
    container_name: weaviate
    ports:
      - "8080:8080"
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
      ENABLE_MODULES: 'text2vec-transformers'
      TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
    volumes:
      - weaviatedata:/var/lib/weaviate
    networks:
      - mirador-net
    restart: unless-stopped

  t2v-transformers:
    image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2
    container_name: t2v-transformers
    environment:
      ENABLE_CUDA: '0'
    networks:
      - mirador-net
    restart: unless-stopped

networks:
  mirador-net:
    driver: bridge

volumes:
  vmdata:
  vldata:
  vtdata:
  valkeydata:
  weaviatedata:

Start production environment:

# Export required environment variables
export CACHE_PASSWORD="your-secure-password"

# Start services
docker-compose up -d

# Check logs
docker-compose logs -f mirador-core

# Verify health
curl http://localhost:8010/api/v1/health

Kubernetes Deployment

For Kubernetes deployments, see deployments/k8s/ directory which includes:

  • Deployment manifests

  • Service definitions

  • ConfigMaps and Secrets

  • Ingress configuration

  • Horizontal Pod Autoscaler (HPA)

Quick deploy:

# Apply all K8s resources
kubectl apply -f deployments/k8s/

# Check deployment status
kubectl get pods -n mirador

# Port-forward for local access
kubectl port-forward -n mirador svc/mirador-core 8010:8010

Component 1: KPI Management

What are KPIs?

Key Performance Indicators (KPIs) are the foundation of observability in Mirador Core. They define:

  • What metrics to monitor (e.g., API error rates, latency, transaction volume)

  • Where to find the data (metrics, logs, traces)

  • How to query it (formulas, query objects)

  • Business context (impact layer vs cause layer, sentiment, domain)

KPIs enable:

  • Registry-driven monitoring: Central source of truth for all monitored signals

  • Correlation discovery: Automatic relationship detection between metrics

  • RCA accuracy: Better root cause chains with pre-defined impact/cause layers

  • Natural language search: Vector-based semantic search over KPI descriptions

KPI Structure

A KPI definition contains:

{
  "id": "kpi-uuid-123",
  "name": "api_errors_total",
  "kind": "tech",
  "layer": "impact",
  "signalType": "metrics",
  "sentiment": "negative",
  "classifier": "errors",
  "datastore": "victoriametrics",
  "queryType": "MetricsQL",
  "unit": "count",
  "format": "integer",
  "formula": "sum(rate(http_requests_total{status=~\"5..\"}[5m]))",
  "definition": "Total API errors at the gateway per minute",
  "businessImpact": "Revenue loss due to failed customer transactions",
  "description": "Tracks API gateway errors - critical service health indicator",
  "tags": ["api", "errors", "critical"],
  "dataType": "timeseries",
  "aggregationWindowHint": "1m",
  "dimensionsHint": ["service.name", "region"],
  "refreshInterval": 60,
  "isShared": true,
  "userId": "user-uuid-456"
}

Required Fields:

  • name: Unique identifier

  • layer: impact (business/user-facing) or cause (infrastructure/technical)

  • sentiment: positive, negative, or neutral

  • signalType: metrics, logs, traces, business, synthetic

Optional But Recommended:

  • formula or query: How to fetch the data

  • businessImpact: Why this metric matters

  • tags: For filtering and organization

Creating KPIs

Single KPI Creation:

POST /api/v1/kpi/defs
Content-Type: application/json

{
  "kpiDefinition": {
    "name": "payment_processing_errors",
    "kind": "business",
    "layer": "impact",
    "signalType": "metrics",
    "sentiment": "negative",
    "classifier": "errors",
    "datastore": "victoriametrics",
    "queryType": "MetricsQL",
    "unit": "count",
    "format": "integer",
    "formula": "sum(rate(payment_errors_total[1m]))",
    "definition": "Failed payment transactions per minute",
    "businessImpact": "Direct revenue loss - each error = failed customer transaction",
    "description": "Critical business KPI tracking payment processing health",
    "tags": ["payments", "critical", "revenue"],
    "dataType": "timeseries",
    "aggregationWindowHint": "1m",
    "dimensionsHint": ["payment_method", "region"],
    "serviceFamily": "payments",
    "domain": "transactions",
    "refreshInterval": 60,
    "isShared": true
  }
}

Response (201 Created):

{
  "status": "created",
  "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}

Using Query Object (Alternative to Formula):

{
  "kpiDefinition": {
    "name": "database_latency_p99",
    "kind": "tech",
    "layer": "cause",
    "signalType": "metrics",
    "sentiment": "negative",
    "queryType": "MetricsQL",
    "query": {
      "metric": "db_query_duration_seconds",
      "aggregation": "quantile",
      "quantile": 0.99,
      "window": "5m"
    },
    "unit": "seconds",
    "format": "float",
    "definition": "99th percentile database query latency",
    "description": "Tracks database performance degradation",
    "tags": ["database", "latency", "performance"]
  }
}

Bulk KPI Import

JSON Bulk Import:

POST /api/v1/kpi/defs/bulk-json
Content-Type: application/json

{
  "kpiDefinitions": [
    {
      "name": "api_latency_p95",
      "layer": "impact",
      "sentiment": "negative",
      "signalType": "metrics",
      "formula": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
      "description": "API 95th percentile latency"
    },
    {
      "name": "kafka_consumer_lag",
      "layer": "cause",
      "sentiment": "negative",
      "signalType": "metrics",
      "formula": "sum(kafka_consumer_lag_max) by (consumer_group)",
      "description": "Kafka consumer lag indicating processing delays"
    }
  ]
}

CSV Bulk Import:

POST /api/v1/kpi/defs/bulk-csv
Content-Type: text/csv

name,layer,sentiment,signalType,formula,description
cpu_usage_percent,cause,negative,metrics,avg(rate(node_cpu_seconds_total[5m])) * 100,CPU utilization percentage
memory_usage_percent,cause,negative,metrics,100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes),Memory utilization percentage
disk_io_wait,cause,negative,metrics,rate(node_disk_io_time_seconds_total[5m]),Disk I/O wait time

Listing and Filtering KPIs

Get all KPIs (paginated):

GET /api/v1/kpi/defs?limit=10&offset=0

Filter by layer:

GET /api/v1/kpi/defs?layer=impact&limit=50

Filter by tags:

GET /api/v1/kpi/defs?tags=critical,payments

Filter by multiple criteria:

GET /api/v1/kpi/defs?layer=cause&sentiment=negative&signalType=metrics&classifier=latency

Response:

{
  "kpiDefinitions": [
    {
      "id": "kpi-uuid-1",
      "name": "api_errors_total",
      "layer": "impact",
      "sentiment": "negative",
      "description": "Total API gateway errors"
    }
  ],
  "total": 150,
  "nextOffset": 10
}

Searching KPIs (Natural Language)

Mirador Core supports vector-based semantic search over KPI descriptions:

POST /api/v1/kpi/search
Content-Type: application/json

{
  "query": "payment transaction failures affecting revenue",
  "limit": 5
}

Response:

{
  "results": [
    {
      "id": "kpi-uuid-123",
      "name": "payment_processing_errors",
      "description": "Failed payment transactions per minute",
      "score": 0.92
    },
    {
      "id": "kpi-uuid-456",
      "name": "transaction_timeout_total",
      "description": "Payment transactions timing out",
      "score": 0.85
    }
  ]
}

Retrieving a Single KPI

GET /api/v1/kpi/defs/{id}

Example:

GET /api/v1/kpi/defs/f47ac10b-58cc-4372-a567-0e02b2c3d479

Response:

{
  "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  "name": "payment_processing_errors",
  "kind": "business",
  "layer": "impact",
  "signalType": "metrics",
  "sentiment": "negative",
  "formula": "sum(rate(payment_errors_total[1m]))",
  "definition": "Failed payment transactions per minute",
  "businessImpact": "Direct revenue loss",
  "description": "Critical business KPI tracking payment processing health",
  "tags": ["payments", "critical", "revenue"],
  "createdAt": "2026-01-20T10:00:00Z",
  "updatedAt": "2026-01-20T10:00:00Z"
}

Updating a KPI

Updates use the same endpoint as creation (upsert behavior):

POST /api/v1/kpi/defs
Content-Type: application/json

{
  "kpiDefinition": {
    "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "name": "payment_processing_errors",
    "formula": "sum(rate(payment_errors_total[5m]))",  # Changed from 1m to 5m
    "description": "Updated: 5-minute aggregation window"
  }
}

Response (200 OK):

{
  "status": "ok",
  "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}

Deleting a KPI

DELETE /api/v1/kpi/defs/{id}

Example:

DELETE /api/v1/kpi/defs/f47ac10b-58cc-4372-a567-0e02b2c3d479

Response (200 OK):

{
  "status": "deleted",
  "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}

Component 2: Failure Detection

What is Failure Detection?

Failure Detection analyzes telemetry data (metrics, logs, traces) within a time window to identify component failures. It:

  • Detects error spans: Traces with error=true tag

  • Identifies anomalies: Metrics with iforest_is_anomaly=true (from AI anomaly detection)

  • Groups by service+component: Aggregates failures per component

  • Generates unique IDs: Deterministic UUIDs for deduplication

  • Persists to storage: Saves failures to Weaviate for historical analysis

When to Use Failure Detection

  • Incident investigation: “What failed between 10:00 and 11:00?”

  • Post-mortem analysis: Identify all affected components during an outage

  • Pre-RCA preparation: Gather failure candidates before running RCA

  • Monitoring dashboards: Track failure trends over time

Detecting Failures

Basic Detection (All Components):

POST /api/v1/unified/failures/detect
Content-Type: application/json

{
  "time_range": {
    "start": "2026-01-23T10:00:00Z",
    "end": "2026-01-23T11:00:00Z"
  }
}

Filtered Detection (Specific Components):

POST /api/v1/unified/failures/detect
Content-Type: application/json

{
  "time_range": {
    "start": "2026-01-23T10:00:00Z",
    "end": "2026-01-23T11:00:00Z"
  },
  "components": ["kafka", "cassandra", "api-gateway"],
  "services": ["payment-service", "auth-service"]
}

Response:

{
  "incidents": [
    {
      "incident_id": "incident_kafka_1737626400",
      "failure_id": "kafka-producer-kafka-20260123-100000",
      "failure_uuid": "e458d90f-f525-58a9-9e92-9f91faa73cf2",
      "time_range": {
        "start": "2026-01-23T10:00:00Z",
        "end": "2026-01-23T10:15:00Z"
      },
      "primary_component": "kafka",
      "affected_transaction_ids": ["txn-123", "txn-456"],
      "services_involved": ["payment-service", "order-service"],
      "failure_mode": "error_spans",
      "confidence": 0.92,
      "severity": "high"
    }
  ],
  "summary": {
    "total_incidents": 3,
    "time_range": {
      "start": "2026-01-23T10:00:00Z",
      "end": "2026-01-23T11:00:00Z"
    },
    "service_component_summaries": [
      {
        "service": "payment-service",
        "component": "kafka",
        "failure_id": "kafka-producer-kafka-20260123-100000",
        "failure_uuid": "e458d90f-f525-58a9-9e92-9f91faa73cf2",
        "failure_count": 42,
        "affected_transactions": 15,
        "average_anomaly_score": 0.87,
        "average_confidence": 0.92,
        "error_spans_count": 38,
        "error_metrics_count": 4,
        "last_failure_timestamp": "2026-01-23T10:14:32Z"
      }
    ],
    "metrics_error_summary": {
      "total_error_metrics": 12,
      "total_anomaly_metrics": 8,
      "error_metrics_by_name": [
        {
          "metric_name": "kafka_producer_errors_total",
          "count": 8,
          "average_value": 3.5,
          "last_timestamp": "2026-01-23T10:14:32Z"
        }
      ],
      "anomaly_metrics_by_name": [
        {
          "metric_name": "kafka_producer_latency_ms",
          "count": 5,
          "average_value": 1250.3,
          "last_timestamp": "2026-01-23T10:14:28Z"
        }
      ]
    }
  }
}

Understanding Failure IDs

Each failure gets two identifiers:

  1. failure_id (human-readable): {service}-{component}-{YYYYMMDD-HHMMSS}

    • Example: kafka-producer-kafka-20260123-100000

    • Easy to read in logs and dashboards

  2. failure_uuid (deterministic UUID v5): e458d90f-f525-58a9-9e92-9f91faa73cf2

    • Unique identifier for storage and deduplication

    • Generated from: service + component + timestamp

    • Same failure detected multiple times = same UUID

Transaction Failure Correlation

Correlate failures for specific transaction IDs to track cascading failures:

POST /api/v1/unified/failures/correlate
Content-Type: application/json

{
  "transactionIDs": ["txn-12345", "txn-67890"],
  "time_range": {
    "start": "2026-01-23T10:00:00Z",
    "end": "2026-01-23T11:00:00Z"
  }
}

Response:

{
  "incidents": [
    {
      "incident_id": "incident_txn_12345",
      "failure_id": "transaction-correlation-20260123-100000",
      "affected_transaction_ids": ["txn-12345"],
      "services_involved": ["api-gateway", "payment-service", "kafka"],
      "failure_sequence": [
        {
          "service": "api-gateway",
          "component": "http-server",
          "timestamp": "2026-01-23T10:00:05Z",
          "error": "timeout waiting for payment-service"
        },
        {
          "service": "payment-service",
          "component": "kafka-producer",
          "timestamp": "2026-01-23T10:00:04Z",
          "error": "kafka broker unavailable"
        }
      ],
      "root_component": "kafka",
      "confidence": 0.95
    }
  ]
}

Listing Stored Failures

Retrieve paginated list of historical failures:

POST /api/v1/unified/failures/list
Content-Type: application/json

{
  "limit": 20,
  "offset": 0,
  "filters": {
    "severity": "high",
    "component": "kafka"
  }
}

Response:

{
  "failures": [
    {
      "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2",
      "failure_id": "kafka-producer-kafka-20260123-100000",
      "summary": "Kafka producer failures in payment-service",
      "severity": "high",
      "timestamp": "2026-01-23T10:00:00Z",
      "component": "kafka",
      "service": "payment-service"
    }
  ],
  "total": 150,
  "nextOffset": 20
}

Getting Failure Details

Retrieve full failure record with all signals and metadata:

POST /api/v1/unified/failures/get
Content-Type: application/json

{
  "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2"
}

Response:

{
  "failure": {
    "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2",
    "failure_id": "kafka-producer-kafka-20260123-100000",
    "time_range": {
      "start": "2026-01-23T10:00:00Z",
      "end": "2026-01-23T10:15:00Z"
    },
    "primary_component": "kafka",
    "services": "payment-service",
    "affected_transaction_count": 15,
    "error_signals": [
      {
        "timestamp": "2026-01-23T10:00:05Z",
        "service": "payment-service",
        "component": "kafka-producer",
        "error_message": "broker unavailable",
        "trace_id": "abc123",
        "span_id": "def456"
      }
    ],
    "anomaly_signals": [
      {
        "timestamp": "2026-01-23T10:00:03Z",
        "metric_name": "kafka_producer_latency_ms",
        "value": 1250.3,
        "is_anomaly": true,
        "anomaly_score": 0.92
      }
    ],
    "metadata": {
      "detection_timestamp": "2026-01-23T10:16:00Z",
      "detection_confidence": 0.92,
      "total_signals": 42
    }
  }
}

Deleting a Failure

Remove a failure record from storage:

POST /api/v1/unified/failures/delete
Content-Type: application/json

{
  "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2"
}

Response:

{
  "status": "deleted",
  "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2"
}

Component 3: Correlation Analysis

What is Correlation Analysis?

Correlation Analysis performs statistical analysis between KPIs to discover relationships and patterns. It:

  • Builds temporal rings: Divides time window into rings (R1: immediate, R2: short, R3: medium, R4: long)

  • Discovers impact KPIs: Identifies metrics showing degradation (red anchors)

  • Finds candidate causes: Detects correlated metrics that may explain the impact

  • Computes statistics: Pearson, Spearman, cross-correlation, partial correlation

  • Scores candidates: Assigns suspicion scores based on statistical strength

Correlation vs RCA

Aspect

Correlation

RCA

Purpose

Find statistical relationships

Explain root cause

Output

List of correlated KPIs with scores

5-WHY chains with narrative

Method

Statistical analysis

Correlation + reasoning

Use Case

“What else changed?”

“Why did it fail?”

Key Insight: RCA uses Correlation results internally, then adds causal reasoning.

Running Correlation Analysis

Time-Window Correlation (Recommended):

POST /api/v1/unified/correlation
Content-Type: application/json

{
  "startTime": "2026-01-23T10:00:00Z",
  "endTime": "2026-01-23T11:00:00Z"
}

Response:

{
  "status": "success",
  "result": {
    "correlationID": "corr_1737626400",
    "timeRange": {
      "start": "2026-01-23T10:00:00Z",
      "end": "2026-01-23T11:00:00Z"
    },
    "rings": {
      "R1_IMMEDIATE": {
        "label": "R1_IMMEDIATE",
        "description": "Anomalies very close to the peak",
        "duration": "5s",
        "start": "2026-01-23T10:55:55Z",
        "end": "2026-01-23T11:00:00Z"
      },
      "R2_SHORT": {
        "label": "R2_SHORT",
        "description": "Anomalies shortly before peak",
        "duration": "30s",
        "start": "2026-01-23T10:55:25Z",
        "end": "2026-01-23T10:55:55Z"
      },
      "R3_MEDIUM": {
        "label": "R3_MEDIUM",
        "description": "Anomalies moderately before peak",
        "duration": "2m",
        "start": "2026-01-23T10:53:25Z",
        "end": "2026-01-23T10:55:25Z"
      },
      "R4_LONG": {
        "label": "R4_LONG",
        "description": "Anomalies further back",
        "duration": "10m",
        "start": "2026-01-23T10:43:25Z",
        "end": "2026-01-23T10:53:25Z"
      }
    },
    "affectedServices": ["payment-service", "kafka"],
    "redAnchors": [
      {
        "service": "payment-service",
        "metric": "payment_processing_errors",
        "score": 0.95,
        "ring": "R1_IMMEDIATE",
        "labelFingerprint": {
          "service.name": "payment-service",
          "region": "us-east-1"
        }
      }
    ],
    "causes": [
      {
        "kpi": "kafka_producer_latency_ms",
        "service": "payment-service",
        "suspicionScore": 0.89,
        "ring": "R2_SHORT",
        "reasons": [
          "high_pearson_correlation",
          "high_spearman_correlation",
          "temporal_precedence",
          "high_anomaly_density"
        ],
        "stats": {
          "pearson": 0.87,
          "spearman": 0.91,
          "crossCorrMax": 0.88,
          "crossCorrLag": -2,
          "partial": 0.82
        },
        "labelFingerprint": {
          "service.name": "payment-service",
          "component": "kafka-producer"
        }
      },
      {
        "kpi": "kafka_broker_connection_errors",
        "service": "kafka",
        "suspicionScore": 0.92,
        "ring": "R3_MEDIUM",
        "reasons": [
          "high_pearson_correlation",
          "temporal_precedence",
          "upstream_component"
        ],
        "stats": {
          "pearson": 0.93,
          "spearman": 0.89,
          "crossCorrMax": 0.91,
          "crossCorrLag": -5,
          "partial": 0.85
        }
      }
    ],
    "confidence": 0.91,
    "createdAt": "2026-01-23T11:01:00Z"
  }
}

Understanding Correlation Results

Red Anchors:

  • Metrics showing impact (business/user-facing degradation)

  • Typically from KPIs with layer=impact

  • High scores = strong impact signal

Cause Candidates:

  • Metrics that correlate with red anchors

  • Typically from KPIs with layer=cause

  • Ranked by suspicion score (higher = more suspicious)

Statistics Explained:

Metric

Meaning

Range

Pearson

Linear correlation strength

-1 to +1

Spearman

Rank correlation (monotonic relationship)

-1 to +1

CrossCorrMax

Maximum cross-correlation

-1 to +1

CrossCorrLag

Time lag (negative = cause precedes impact)

seconds

Partial

Correlation after removing confounders

-1 to +1

Suspicion Score Calculation:

suspicionScore = weighted_average(
  pearson_correlation,
  spearman_correlation,
  cross_correlation_max,
  partial_correlation,
  anomaly_density,
  temporal_precedence
)

Reasons (Why This Candidate is Suspicious):

  • high_pearson_correlation: Strong linear relationship

  • high_spearman_correlation: Strong monotonic relationship

  • temporal_precedence: Cause occurred before impact

  • high_anomaly_density: Many anomalies in this metric

  • upstream_component: Component is upstream in service graph

Time Window Constraints

Correlation enforces time window limits from configuration:

engine:
  min_window: 1m   # Minimum analysis window
  max_window: 1h   # Maximum analysis window

Invalid Windows:

# Too small
{"startTime": "2026-01-23T10:00:00Z", "endTime": "2026-01-23T10:00:30Z"}
# Error: time window too small: 30s < minWindow 1m

# Too large
{"startTime": "2026-01-23T00:00:00Z", "endTime": "2026-01-23T23:59:59Z"}
# Error: time window too large: 23h59m59s > maxWindow 1h

Component 4: Root Cause Analysis (RCA)

What is RCA?

Root Cause Analysis (RCA) combines:

  1. Correlation results (statistical relationships)

  2. 5 WHY methodology (iterative questioning)

  3. Service topology (upstream/downstream relationships)

  4. Temporal rings (time-based evidence)

To produce human-readable explanations of why incidents occurred.

RCA Process Flow

1. User provides time window
        ↓
2. RCA engine calls Correlation engine
        ↓
3. Correlation returns:
   - Red anchors (impacts)
   - Cause candidates (correlated metrics)
   - Statistical evidence
        ↓
4. RCA builds 5-WHY chains:
   - WHY 1: Business impact (what failed?)
   - WHY 2: Entry service degradation
   - WHY 3-5: Upstream causes (evidence-driven)
        ↓
5. Returns structured narrative with:
   - Impact summary
   - Causal chains
   - Time rings
   - Diagnostic details

Running RCA

Basic RCA Request:

POST /api/v1/unified/rca
Content-Type: application/json

{
  "startTime": "2026-01-23T10:00:00Z",
  "endTime": "2026-01-23T11:00:00Z"
}

Response:

{
  "status": "success",
  "data": {
    "impact": {
      "id": "incident_payment_service_1737626400",
      "impactService": "payment-service",
      "metricName": "payment_processing_errors",
      "timeStart": "2026-01-23T10:00:00Z",
      "timeEnd": "2026-01-23T11:00:00Z",
      "impactSummary": "Impact detected on payment-service (correlation confidence 0.91). Top-candidate kafka_producer_latency_ms: pearson=0.87 spearman=0.91 partial=0.82 cross_max=0.88 lag=-2 anomalies=HIGH",
      "severity": 0.91
    },
    "chains": [
      {
        "steps": [
          {
            "why": 1,
            "service": "payment-service",
            "kpiName": "payment_processing_errors",
            "timeRange": {
              "start": "2026-01-23T10:00:00Z",
              "end": "2026-01-23T11:00:00Z"
            },
            "ring": "R1_IMMEDIATE",
            "direction": "SAME",
            "score": 0.91,
            "evidence": [
              {
                "type": "red_anchor",
                "key": "payment-service",
                "value": "anchor_score=0.950"
              }
            ],
            "summary": "Payment processing failed: payment_processing_errors increased dramatically in R1_IMMEDIATE (0.91 confidence)"
          },
          {
            "why": 2,
            "service": "payment-service",
            "kpiName": "payment_processing_errors",
            "timeRange": {
              "start": "2026-01-23T10:00:00Z",
              "end": "2026-01-23T11:00:00Z"
            },
            "ring": "R1_IMMEDIATE",
            "direction": "UPSTREAM",
            "score": 0.95,
            "evidence": [
              {
                "type": "red_anchor",
                "key": "payment-service",
                "value": "metric=payment_processing_errors score=0.950"
              }
            ],
            "summary": "payment-service degraded: payment_processing_errors showed anomalies in R1_IMMEDIATE (0.95 confidence)"
          },
          {
            "why": 3,
            "service": "payment-service",
            "kpiName": "kafka_producer_latency_ms",
            "timeRange": {
              "start": "2026-01-23T10:00:00Z",
              "end": "2026-01-23T11:00:00Z"
            },
            "ring": "R2_SHORT",
            "direction": "UPSTREAM",
            "score": 0.89,
            "evidence": [
              {
                "type": "correlation_stats",
                "key": "kafka_producer_latency_ms",
                "value": "pearson=0.87 spearman=0.91 cross_lag=-2 suspicion=0.89"
              },
              {
                "type": "correlation_reason",
                "key": "kafka_producer_latency_ms",
                "value": "high_pearson_correlation"
              },
              {
                "type": "correlation_reason",
                "key": "kafka_producer_latency_ms",
                "value": "temporal_precedence"
              }
            ],
            "summary": "Upstream component kafka_producer_latency_ms caused payment-service degradation (0.89 suspicion, pearson=0.87)"
          },
          {
            "why": 4,
            "service": "kafka",
            "kpiName": "kafka_broker_connection_errors",
            "timeRange": {
              "start": "2026-01-23T10:00:00Z",
              "end": "2026-01-23T11:00:00Z"
            },
            "ring": "R2_SHORT",
            "direction": "UPSTREAM",
            "score": 0.92,
            "evidence": [
              {
                "type": "correlation_stats",
                "key": "kafka_broker_connection_errors",
                "value": "pearson=0.93 spearman=0.89 cross_lag=-5 suspicion=0.92"
              },
              {
                "type": "correlation_reason",
                "key": "kafka_broker_connection_errors",
                "value": "high_pearson_correlation"
              },
              {
                "type": "correlation_reason",
                "key": "kafka_broker_connection_errors",
                "value": "upstream_component"
              }
            ],
            "summary": "Upstream component kafka_broker_connection_errors caused payment-service degradation (0.92 suspicion, pearson=0.93)"
          }
        ],
        "score": 0.91,
        "confidence": 0.91
      }
    ],
    "generatedAt": "2026-01-23T11:01:30Z",
    "score": 0.91,
    "notes": [],
    "diagnostics": {},
    "timeRings": {
      "definitions": {
        "R1_IMMEDIATE": {
          "label": "R1_IMMEDIATE",
          "description": "Anomalies very close to the peak",
          "duration": "5s"
        },
        "R2_SHORT": {
          "label": "R2_SHORT",
          "description": "Anomalies shortly before peak",
          "duration": "30s"
        },
        "R3_MEDIUM": {
          "label": "R3_MEDIUM",
          "description": "Anomalies moderately before peak",
          "duration": "2m"
        },
        "R4_LONG": {
          "label": "R4_LONG",
          "description": "Anomalies further back",
          "duration": "10m"
        }
      },
      "perChain": []
    }
  },
  "timestamp": "2026-01-23T11:01:30Z"
}

Understanding RCA Output

Impact Section:

  • impactService: Which service was affected

  • metricName: What metric degraded

  • impactSummary: Human-readable summary with statistical evidence

  • severity: Confidence score (0-1)

Chains (5-WHY Chains):

Each chain represents one possible root cause path:

Step Structure:

  • why: Step number (1-5)

  • service: Service involved at this step

  • kpiName: Metric/KPI at this step

  • ring: Temporal ring (when it occurred)

  • direction: SAME (impact), UPSTREAM (cause), DOWNSTREAM (effect)

  • score: Confidence for this step

  • evidence: Statistical and correlation evidence

  • summary: Human-readable explanation

Chain Scoring:

  • Weighted average: earlier steps (WHY 1-2) weighted higher

  • Higher score = more confident root cause path

  • Multiple chains = multiple possible root causes (sorted by score)

Time Rings:

  • R1_IMMEDIATE (5s): Events very close to peak

  • R2_SHORT (30s): Events shortly before peak

  • R3_MEDIUM (2m): Events moderately before peak

  • R4_LONG (10m): Events further back

Rings help identify temporal ordering (cause precedes effect).

Interpreting a 5-WHY Chain

Example chain interpretation:

WHY 1 (Business Impact): "Payment processing failed"
  → What the user experienced
  → Business/revenue impact

WHY 2 (Entry Service): "payment-service degraded"
  → Which service exhibited the problem
  → Where the impact manifested

WHY 3 (Direct Cause): "kafka_producer_latency_ms increased"
  → Immediate technical cause
  → Component directly affecting service

WHY 4 (Upstream Cause): "kafka_broker_connection_errors occurred"
  → Root infrastructure issue
  → What actually triggered the cascade

WHY 5 (Optional): Further upstream causes if available

Low Confidence RCA

If no correlation data is available:

Request:

{
  "startTime": "2026-01-01T00:00:00Z",
  "endTime": "2026-01-01T00:05:00Z"
}

Response:

{
  "status": "success",
  "data": {
    "impact": {
      "id": "incident_unknown",
      "impactService": "unknown",
      "metricName": "unknown",
      "impactSummary": "No correlation data for window 2026-01-01 00:00:00 +0000 UTC - 2026-01-01 00:05:00 +0000 UTC",
      "severity": 0
    },
    "chains": [],
    "score": 0,
    "notes": ["Correlation produced no candidates; returning low-confidence RCA"]
  }
}

This indicates:

  • No KPIs were found with degradation in this window

  • Or KPI registry is empty

  • Or VictoriaMetrics/Logs/Traces have no data for this period

Resolution:

  1. Verify KPIs are defined

  2. Check time window contains actual incidents

  3. Ensure telemetry data exists in VictoriaMetrics/Logs/Traces


Complete Workflows

Workflow 1: Full Incident Investigation

Scenario: Payment processing outage on Jan 23, 2026 between 10:00-11:00

Step 1: Define KPIs (if not already done)

# Define impact KPI
POST /api/v1/kpi/defs
{
  "kpiDefinition": {
    "name": "payment_processing_errors",
    "layer": "impact",
    "sentiment": "negative",
    "signalType": "metrics",
    "formula": "sum(rate(payment_errors_total[1m]))",
    "description": "Failed payment transactions"
  }
}

# Define cause KPIs
POST /api/v1/kpi/defs
{
  "kpiDefinition": {
    "name": "kafka_producer_latency_ms",
    "layer": "cause",
    "sentiment": "negative",
    "signalType": "metrics",
    "formula": "histogram_quantile(0.99, kafka_producer_latency_bucket)",
    "description": "Kafka producer p99 latency"
  }
}

POST /api/v1/kpi/defs
{
  "kpiDefinition": {
    "name": "database_connection_pool_exhausted",
    "layer": "cause",
    "sentiment": "negative",
    "signalType": "metrics",
    "formula": "db_pool_active / db_pool_max > 0.95",
    "description": "Database connection pool near capacity"
  }
}

Step 2: Detect Failures

POST /api/v1/unified/failures/detect
{
  "time_range": {
    "start": "2026-01-23T10:00:00Z",
    "end": "2026-01-23T11:00:00Z"
  }
}

Result: Identified failures in:

  • payment-service (kafka component)

  • database-service (connection pool)

  • api-gateway (timeouts)

Step 3: Run Correlation Analysis

POST /api/v1/unified/correlation
{
  "startTime": "2026-01-23T10:00:00Z",
  "endTime": "2026-01-23T11:00:00Z"
}

Result: Correlation found:

  • payment_processing_errors (impact)

  • kafka_producer_latency_ms (cause, suspicion=0.89)

  • database_connection_pool_exhausted (cause, suspicion=0.75)

Step 4: Run RCA

POST /api/v1/unified/rca
{
  "startTime": "2026-01-23T10:00:00Z",
  "endTime": "2026-01-23T11:00:00Z"
}

Result: RCA chain:

  1. WHY 1: Payment processing failed (business impact)

  2. WHY 2: payment-service degraded (errors increased)

  3. WHY 3: kafka_producer_latency_ms spiked (direct cause)

  4. WHY 4: kafka_broker_connection_errors occurred (root cause)

Conclusion: Kafka broker connection failures caused producer latency, leading to payment processing errors.

Workflow 2: Proactive Monitoring Setup

Scenario: Set up monitoring for a new microservice

Step 1: Bulk Import KPIs

# Create kpis.csv
name,layer,sentiment,signalType,formula,description
user_service_api_latency_p99,impact,negative,metrics,histogram_quantile(0.99\\, rate(http_request_duration_seconds_bucket{service=\"user-service\"}[5m])),API latency affecting users
user_service_error_rate,impact,negative,metrics,sum(rate(http_requests_total{service=\"user-service\"\\,status=~\"5..\"}[1m])) / sum(rate(http_requests_total{service=\"user-service\"}[1m])),Error rate impacting reliability
user_service_cpu_usage,cause,negative,metrics,avg(rate(process_cpu_seconds_total{service=\"user-service\"}[5m])) * 100,CPU utilization
user_service_memory_usage,cause,negative,metrics,process_resident_memory_bytes{service=\"user-service\"} / 1024 / 1024,Memory usage in MB
user_service_db_query_latency,cause,negative,metrics,histogram_quantile(0.95\\, rate(db_query_duration_seconds_bucket{service=\"user-service\"}[5m])),Database query latency

# Import
POST /api/v1/kpi/defs/bulk-csv < kpis.csv

Step 2: Verify KPIs

GET /api/v1/kpi/defs?tags=user-service&limit=10

Step 3: Test Correlation

# Run correlation for last hour
POST /api/v1/unified/correlation
{
  "startTime": "2026-01-23T10:00:00Z",
  "endTime": "2026-01-23T11:00:00Z"
}

Step 4: Schedule Periodic RCA

Set up a cron job or monitoring system to:

# Run RCA every 15 minutes for the last 15 minutes
*/15 * * * * curl -X POST http://mirador-core:8010/api/v1/unified/rca \
  -H "Content-Type: application/json" \
  -d "{\"startTime\":\"$(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%SZ)\",\"endTime\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}"

Workflow 3: Historical Incident Analysis

Scenario: Analyze pattern of failures over past week

Step 1: List Failures

POST /api/v1/unified/failures/list
{
  "limit": 100,
  "offset": 0,
  "filters": {
    "start_time": "2026-01-16T00:00:00Z",
    "end_time": "2026-01-23T00:00:00Z",
    "severity": "high"
  }
}

Step 2: Analyze Each Failure

# For each failure_uuid from step 1
POST /api/v1/unified/failures/get
{
  "id": "e458d90f-f525-58a9-9e92-9f91faa73cf2"
}

Step 3: Run RCA for Each Incident

# For each incident time window
POST /api/v1/unified/rca
{
  "startTime": "<incident_start>",
  "endTime": "<incident_end>"
}

Step 4: Aggregate Patterns

Analyze RCA results to find:

  • Common root causes (e.g., kafka broker issues appearing in 80% of incidents)

  • Frequently affected services (e.g., payment-service)

  • Temporal patterns (e.g., failures cluster around 10 AM)


API Reference

Base URL

http://localhost:8010

Authentication

Mirador Core is designed to run behind an external API gateway or service mesh that handles authentication. No built-in authentication is required for API calls.

Content Type

All requests must include:

Content-Type: application/json

KPI Endpoints

Method

Endpoint

Description

GET

/api/v1/kpi/defs

List KPI definitions

POST

/api/v1/kpi/defs

Create/update KPI definition

GET

/api/v1/kpi/defs/{id}

Get single KPI definition

DELETE

/api/v1/kpi/defs/{id}

Delete KPI definition

POST

/api/v1/kpi/defs/bulk-json

Bulk import KPIs (JSON)

POST

/api/v1/kpi/defs/bulk-csv

Bulk import KPIs (CSV)

POST

/api/v1/kpi/search

Semantic search for KPIs

Failure Detection Endpoints

Method

Endpoint

Description

POST

/api/v1/unified/failures/detect

Detect component failures

POST

/api/v1/unified/failures/correlate

Correlate transaction failures

POST

/api/v1/unified/failures/list

List stored failures

POST

/api/v1/unified/failures/get

Get failure details

POST

/api/v1/unified/failures/delete

Delete failure record

Correlation Endpoints

Method

Endpoint

Description

POST

/api/v1/unified/correlation

Run correlation analysis

RCA Endpoints

Method

Endpoint

Description

POST

/api/v1/unified/rca

Compute root cause analysis

Internal Endpoints

Method

Endpoint

Description

GET

/health

Health check

GET

/ready

Readiness check

GET

/metrics

Prometheus metrics

GET

/api/v1/health

API v1 health

GET

/api/openapi.yaml

OpenAPI spec (YAML)

GET

/api/openapi.json

OpenAPI spec (JSON)

GET

/swagger/index.html

Swagger UI

Query Parameters

KPI List Filtering:

  • limit (int): Max results (default: 10, max: 10000)

  • offset (int): Pagination offset (default: 0)

  • tags (string[]): Filter by tags (comma-separated)

  • layer (string): Filter by layer (impact/cause)

  • sentiment (string): Filter by sentiment (positive/negative/neutral)

  • signalType (string): Filter by signal type (metrics/logs/traces/business/synthetic)

  • kind (string): Filter by kind (business/tech)

  • classifier (string): Filter by classifier

  • datastore (string): Filter by datastore

Example:

GET /api/v1/kpi/defs?layer=impact&sentiment=negative&limit=50&tags=critical,payments

Troubleshooting

Problem: KPIs Not Being Detected by Correlation

Symptoms:

  • Correlation returns empty results

  • RCA shows “No correlation data”

Diagnosis:

# 1. Check if KPIs exist
GET /api/v1/kpi/defs?limit=10

# 2. Check if KPIs have formulas
GET /api/v1/kpi/defs/{id}
# Verify "formula" or "query" field is populated

# 3. Test formula manually against VictoriaMetrics
curl "http://victoriametrics:8428/api/v1/query?query=<your_formula>&time=$(date +%s)"

Solutions:

  1. Ensure KPIs have valid formula or query fields

  2. Verify VictoriaMetrics contains data for the formula

  3. Check time window overlaps with actual data availability

Problem: Failure Detection Returns Empty Results

Symptoms:

  • /unified/failures/detect returns "total_incidents": 0

Diagnosis:

# 1. Check if traces exist
curl "http://victoriatraces:10428/api/v1/search?start=<start_epoch>&end=<end_epoch>&tags={}"

# 2. Check if metrics exist
curl "http://victoriametrics:8428/api/v1/query_range?query=up&start=<start>&end=<end>&step=60"

# 3. Verify time window format
# Must be RFC3339 UTC: "2026-01-23T10:00:00Z"

Solutions:

  1. Ensure traces are being ingested to VictoriaTraces

  2. Verify error spans have error=true tag

  3. Check anomaly metrics have iforest_is_anomaly=true label

  4. Confirm time window contains actual incident data

Problem: RCA Returns Low Confidence

Symptoms:

  • RCA score is 0 or very low

  • Chains are empty

  • Notes contain: “Correlation produced no candidates”

Diagnosis:

# 1. Run correlation first to see what it finds
POST /api/v1/unified/correlation
{
  "startTime": "...",
  "endTime": "..."
}

# 2. Check correlation response
# If empty, diagnose correlation (see above)

# 3. Verify KPIs have layer=impact and layer=cause defined
GET /api/v1/kpi/defs?layer=impact
GET /api/v1/kpi/defs?layer=cause

Solutions:

  1. Define at least one layer=impact KPI

  2. Define multiple layer=cause KPIs

  3. Ensure time window contains degradation events

  4. Run failure detection first to confirm incidents exist

Problem: Time Window Validation Errors

Symptoms:

  • 400 Bad Request: time window too small

  • 413 Payload Too Large: time window too large

Diagnosis:

# Check engine configuration
curl http://localhost:8010/api/v1/unified/metadata | jq '.engineConfig'

Solutions:

  1. Adjust window to respect min_window and max_window

  2. Default constraints:

    • min_window: 1m

    • max_window: 1h

  3. Update configs/config.yaml if constraints are too restrictive

Problem: Weaviate Connection Failures

Symptoms:

  • KPI creation fails with “weaviate unavailable”

  • Failure detection works but failures aren’t persisted

Diagnosis:

# 1. Check Weaviate health
curl http://weaviate:8080/v1/.well-known/ready

# 2. Check Mirador Core logs
docker logs mirador-core | grep -i weaviate

# 3. Verify Weaviate is enabled in config
cat configs/config.yaml | grep -A5 weaviate

Solutions:

  1. Ensure Weaviate container is running: docker ps | grep weaviate

  2. Verify network connectivity: docker network inspect mirador-net

  3. Set weaviate.enabled: true in config

  4. Restart Mirador Core: docker restart mirador-core

Problem: High Memory Usage

Symptoms:

  • Mirador Core container OOM killed

  • Slow response times

Diagnosis:

# Check container memory
docker stats mirador-core

# Check VictoriaMetrics data volume
docker exec victoriametrics du -sh /victoria-metrics-data

Solutions:

  1. Increase container memory limit in docker-compose.yml:

    deploy:
      resources:
        limits:
          memory: 4G
    
  2. Reduce default_query_limit in config.yaml

  3. Enable more aggressive caching (increase TTL)

  4. Reduce max_window to limit analysis scope

Problem: Correlation Takes Too Long

Symptoms:

  • Correlation requests timeout

  • High CPU usage during correlation

Diagnosis:

# Check number of KPIs
GET /api/v1/kpi/defs | jq '.total'

# Check time window size
# Large windows = more data to analyze

Solutions:

  1. Reduce time window (use 15m instead of 1h)

  2. Reduce number of active KPIs (archive unused ones)

  3. Increase timeout in config:

    database:
      victoria_metrics:
        timeout: 60000  # 60 seconds
    
  4. Scale Mirador Core horizontally (add more replicas)

Getting Help

Logs:

# Mirador Core logs
docker logs mirador-core --tail 100 -f

# All services logs
docker-compose logs -f

Health Checks:

# Overall health
curl http://localhost:8010/api/v1/health | jq

# Service status
curl http://localhost:8010/microservices/status | jq

Metrics:

# Prometheus metrics
curl http://localhost:8010/metrics | grep mirador

Support:

  • GitHub Issues: https://github.com/mirastacklabs-ai/mirador-core/issues

  • Documentation: http://localhost:8010/swagger/index.html


Summary

This guide covered the four core components of Mirador Core:

  1. KPIs: Define and manage metrics (foundation)

  2. Failures: Detect and track incidents

  3. Correlation: Analyze statistical relationships

  4. RCA: Explain root causes with 5-WHY methodology

Key Takeaways:

  • Always define KPIs before running failure detection or RCA

  • Use impact (layer=impact) and cause (layer=cause) KPIs for best results

  • Correlation provides statistical evidence; RCA adds causal reasoning

  • Time windows must respect configured min/max constraints

  • All components are interconnected and build upon each other

For complete API documentation, see the Swagger UI.