KPI, Failures, Correlation & RCA User Guide
Version: 10.0.0
Last Updated: January 2026
Target Audience: API Consumers & Integration Engineers
Table of Contents
Overview
What This Guide Covers
This guide explains how to use Mirador Core’s four interconnected observability components:
┌──────────┐ ┌──────────┐ ┌─────────────┐ ┌──────┐
│ KPIs │ --> │ Failures │ --> │ Correlation │ --> │ RCA │
└──────────┘ └──────────┘ └─────────────┘ └──────┘
Define Detect Analyze Explain
Metrics Incidents Patterns Root Cause
Dependency Chain
Each component builds upon the previous:
KPIs (Key Performance Indicators): Define what metrics to monitor
Failures: Detect incidents based on KPI anomalies and error signals
Correlation: Perform statistical analysis to find relationships between KPIs
RCA (Root Cause Analysis): Use correlation data + 5 WHY methodology to identify root causes
Important: Without KPIs defined, Failure Detection will have limited effectiveness. Without Failures and KPIs, RCA cannot function.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Mirador Core API │
│ (localhost:8010) │
└─────────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌───▼────┐ ┌──────▼───────┐ ┌────▼─────┐
│ KPI │ │ Correlation │ │ RCA │
│ Repo │ │ Engine │ │ Engine │
└───┬────┘ └──────┬───────┘ └────┬─────┘
│ │ │
│ ┌───────▼────────┐ │
│ │ Failure │ │
└──────────► Detection ◄──────────┘
└────────┬───────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌────▼────┐ ┌──────▼───────┐ ┌─────▼─────┐
│Victoria │ │ Victoria │ │ Victoria │
│ Metrics │ │ Logs │ │ Traces │
└─────────┘ └──────────────┘ └───────────┘
Prerequisites & Setup
System Requirements
Mirador Core:
Go 1.21+ (for building from source)
Docker & Docker Compose (for containerized deployment)
4GB RAM minimum (8GB recommended)
20GB disk space
Required Backend Services:
VictoriaMetrics (metrics storage)
VictoriaLogs (logs storage)
VictoriaTraces (traces storage)
Valkey/Redis (caching)
Weaviate (optional, for KPI vector storage)
Quick Start
Using Docker Compose (Recommended):
# 1. Clone repository
git clone https://github.com/mirastacklabs-ai/mirador-core
cd mirador-core
# 2. Start all services
make localdev-up
# 3. Wait for services to be ready (monitors health checks)
make localdev-wait
# 4. Verify Mirador Core is running
curl http://localhost:8010/api/v1/health
# 5. Seed sample KPIs (optional)
make localdev-seed-data
Expected Output:
{
"status": "healthy",
"timestamp": "2026-01-23T10:00:00Z",
"services": {
"mirador-core": "ok",
"victoriametrics": "ok",
"victorialogs": "ok",
"victoriatraces": "ok",
"valkey": "ok"
}
}
Access Points
Once running, you can access:
Mirador Core API:
http://localhost:8010Swagger UI:
http://localhost:8010/swagger/index.htmlOpenAPI Spec:
http://localhost:8010/api/openapi.yamlHealth Check:
http://localhost:8010/healthPrometheus Metrics:
http://localhost:8010/metrics
Configuration & Deployment
Configuration File Structure
Mirador Core uses a YAML configuration file located at configs/config.yaml:
# Basic Settings
environment: production # or development
port: 8010
log_level: info
# VictoriaMetrics Ecosystem
database:
victoria_metrics:
endpoints:
- "http://victoriametrics:8428"
timeout: 30000
cluster_mode: false
victoria_logs:
endpoints:
- "http://victorialogs:9428"
timeout: 30000
victoria_traces:
endpoints:
- "http://victoriatraces:10428"
timeout: 30000
# Caching (Valkey/Redis)
cache:
nodes:
- "valkey:6379"
ttl: 300 # 5 minutes default
password: "" # Set via env: CACHE_PASSWORD
db: 0
# CORS (for frontend integration)
cors:
allowed_origins:
- "https://your-mirador-ui.com"
allowed_methods:
- "GET"
- "POST"
- "PUT"
- "DELETE"
# Weaviate (optional - for KPI vector storage)
weaviate:
enabled: true
scheme: "http"
host: "weaviate"
port: 8080
vectorizer:
provider: "text2vec-transformers"
model: "sentence-transformers/all-MiniLM-L6-v2"
use_gpu: false
# Engine Configuration
engine:
# Time window constraints
min_window: 1m
max_window: 1h
# Payload validation
strict_time_window_payload: true
strict_time_window: true
# Correlation settings
correlation_threshold: 0.7
default_graph_hops: 3
default_max_whys: 5
# Ring strategy for RCA
ring_strategy: "default"
# Query limits
default_query_limit: 1000
Environment Variables
Override configuration via environment variables:
# Database credentials
export VM_PASSWORD="your-victoriametrics-password"
export VL_PASSWORD="your-victorialogs-password"
# Cache credentials
export CACHE_PASSWORD="your-valkey-password"
# Application settings
export PORT=8010
export LOG_LEVEL=info
export ENVIRONMENT=production
# Weaviate connection
export WEAVIATE_HOST=weaviate
export WEAVIATE_PORT=8080
Docker Deployment
Production docker-compose.yml:
version: '3.8'
services:
mirador-core:
image: miradorstack/mirador-core:latest
container_name: mirador-core
ports:
- "8010:8010"
environment:
- ENVIRONMENT=production
- LOG_LEVEL=info
- VM_ENDPOINT=http://victoriametrics:8428
- VL_ENDPOINT=http://victorialogs:9428
- VT_ENDPOINT=http://victoriatraces:10428
- CACHE_NODES=valkey:6379
- CACHE_PASSWORD=${CACHE_PASSWORD}
- WEAVIATE_HOST=weaviate
- WEAVIATE_PORT=8080
volumes:
- ./configs:/app/configs:ro
depends_on:
- victoriametrics
- victorialogs
- victoriatraces
- valkey
- weaviate
restart: unless-stopped
networks:
- mirador-net
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8010/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
victoriametrics:
image: victoriametrics/victoria-metrics:latest
container_name: victoriametrics
ports:
- "8428:8428"
volumes:
- vmdata:/victoria-metrics-data
command:
- --storageDataPath=/victoria-metrics-data
- --search.maxUniqueTimeseries=2000000
- --memory.allowedPercent=90
networks:
- mirador-net
restart: unless-stopped
victorialogs:
image: victoriametrics/victoria-logs:latest
container_name: victorialogs
ports:
- "9428:9428"
volumes:
- vldata:/victoria-logs-data
command:
- --storageDataPath=/victoria-logs-data
networks:
- mirador-net
restart: unless-stopped
victoriatraces:
image: victoriametrics/victoria-traces:latest
container_name: victoriatraces
ports:
- "10428:10428"
volumes:
- vtdata:/victoria-traces-data
command:
- --storageDataPath=/victoria-traces-data
networks:
- mirador-net
restart: unless-stopped
valkey:
image: valkey/valkey:latest
container_name: valkey
ports:
- "6379:6379"
command: >
valkey-server
--requirepass ${CACHE_PASSWORD}
--maxmemory 2gb
--maxmemory-policy allkeys-lru
volumes:
- valkeydata:/data
networks:
- mirador-net
restart: unless-stopped
weaviate:
image: semitechnologies/weaviate:latest
container_name: weaviate
ports:
- "8080:8080"
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
ENABLE_MODULES: 'text2vec-transformers'
TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
volumes:
- weaviatedata:/var/lib/weaviate
networks:
- mirador-net
restart: unless-stopped
t2v-transformers:
image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2
container_name: t2v-transformers
environment:
ENABLE_CUDA: '0'
networks:
- mirador-net
restart: unless-stopped
networks:
mirador-net:
driver: bridge
volumes:
vmdata:
vldata:
vtdata:
valkeydata:
weaviatedata:
Start production environment:
# Export required environment variables
export CACHE_PASSWORD="your-secure-password"
# Start services
docker-compose up -d
# Check logs
docker-compose logs -f mirador-core
# Verify health
curl http://localhost:8010/api/v1/health
Kubernetes Deployment
For Kubernetes deployments, see deployments/k8s/ directory which includes:
Deployment manifests
Service definitions
ConfigMaps and Secrets
Ingress configuration
Horizontal Pod Autoscaler (HPA)
Quick deploy:
# Apply all K8s resources
kubectl apply -f deployments/k8s/
# Check deployment status
kubectl get pods -n mirador
# Port-forward for local access
kubectl port-forward -n mirador svc/mirador-core 8010:8010
Component 1: KPI Management
What are KPIs?
Key Performance Indicators (KPIs) are the foundation of observability in Mirador Core. They define:
What metrics to monitor (e.g., API error rates, latency, transaction volume)
Where to find the data (metrics, logs, traces)
How to query it (formulas, query objects)
Business context (impact layer vs cause layer, sentiment, domain)
KPIs enable:
Registry-driven monitoring: Central source of truth for all monitored signals
Correlation discovery: Automatic relationship detection between metrics
RCA accuracy: Better root cause chains with pre-defined impact/cause layers
Natural language search: Vector-based semantic search over KPI descriptions
KPI Structure
A KPI definition contains:
{
"id": "kpi-uuid-123",
"name": "api_errors_total",
"kind": "tech",
"layer": "impact",
"signalType": "metrics",
"sentiment": "negative",
"classifier": "errors",
"datastore": "victoriametrics",
"queryType": "MetricsQL",
"unit": "count",
"format": "integer",
"formula": "sum(rate(http_requests_total{status=~\"5..\"}[5m]))",
"definition": "Total API errors at the gateway per minute",
"businessImpact": "Revenue loss due to failed customer transactions",
"description": "Tracks API gateway errors - critical service health indicator",
"tags": ["api", "errors", "critical"],
"dataType": "timeseries",
"aggregationWindowHint": "1m",
"dimensionsHint": ["service.name", "region"],
"refreshInterval": 60,
"isShared": true,
"userId": "user-uuid-456"
}
Required Fields:
name: Unique identifierlayer:impact(business/user-facing) orcause(infrastructure/technical)sentiment:positive,negative, orneutralsignalType:metrics,logs,traces,business,synthetic
Optional But Recommended:
formulaorquery: How to fetch the databusinessImpact: Why this metric matterstags: For filtering and organization
Creating KPIs
Single KPI Creation:
POST /api/v1/kpi/defs
Content-Type: application/json
{
"kpiDefinition": {
"name": "payment_processing_errors",
"kind": "business",
"layer": "impact",
"signalType": "metrics",
"sentiment": "negative",
"classifier": "errors",
"datastore": "victoriametrics",
"queryType": "MetricsQL",
"unit": "count",
"format": "integer",
"formula": "sum(rate(payment_errors_total[1m]))",
"definition": "Failed payment transactions per minute",
"businessImpact": "Direct revenue loss - each error = failed customer transaction",
"description": "Critical business KPI tracking payment processing health",
"tags": ["payments", "critical", "revenue"],
"dataType": "timeseries",
"aggregationWindowHint": "1m",
"dimensionsHint": ["payment_method", "region"],
"serviceFamily": "payments",
"domain": "transactions",
"refreshInterval": 60,
"isShared": true
}
}
Response (201 Created):
{
"status": "created",
"id": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}
Using Query Object (Alternative to Formula):
{
"kpiDefinition": {
"name": "database_latency_p99",
"kind": "tech",
"layer": "cause",
"signalType": "metrics",
"sentiment": "negative",
"queryType": "MetricsQL",
"query": {
"metric": "db_query_duration_seconds",
"aggregation": "quantile",
"quantile": 0.99,
"window": "5m"
},
"unit": "seconds",
"format": "float",
"definition": "99th percentile database query latency",
"description": "Tracks database performance degradation",
"tags": ["database", "latency", "performance"]
}
}
Bulk KPI Import
JSON Bulk Import:
POST /api/v1/kpi/defs/bulk-json
Content-Type: application/json
{
"kpiDefinitions": [
{
"name": "api_latency_p95",
"layer": "impact",
"sentiment": "negative",
"signalType": "metrics",
"formula": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"description": "API 95th percentile latency"
},
{
"name": "kafka_consumer_lag",
"layer": "cause",
"sentiment": "negative",
"signalType": "metrics",
"formula": "sum(kafka_consumer_lag_max) by (consumer_group)",
"description": "Kafka consumer lag indicating processing delays"
}
]
}
CSV Bulk Import:
POST /api/v1/kpi/defs/bulk-csv
Content-Type: text/csv
name,layer,sentiment,signalType,formula,description
cpu_usage_percent,cause,negative,metrics,avg(rate(node_cpu_seconds_total[5m])) * 100,CPU utilization percentage
memory_usage_percent,cause,negative,metrics,100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes),Memory utilization percentage
disk_io_wait,cause,negative,metrics,rate(node_disk_io_time_seconds_total[5m]),Disk I/O wait time
Listing and Filtering KPIs
Get all KPIs (paginated):
GET /api/v1/kpi/defs?limit=10&offset=0
Filter by layer:
GET /api/v1/kpi/defs?layer=impact&limit=50
Filter by tags:
GET /api/v1/kpi/defs?tags=critical,payments
Filter by multiple criteria:
GET /api/v1/kpi/defs?layer=cause&sentiment=negative&signalType=metrics&classifier=latency
Response:
{
"kpiDefinitions": [
{
"id": "kpi-uuid-1",
"name": "api_errors_total",
"layer": "impact",
"sentiment": "negative",
"description": "Total API gateway errors"
}
],
"total": 150,
"nextOffset": 10
}
Searching KPIs (Natural Language)
Mirador Core supports vector-based semantic search over KPI descriptions:
POST /api/v1/kpi/search
Content-Type: application/json
{
"query": "payment transaction failures affecting revenue",
"limit": 5
}
Response:
{
"results": [
{
"id": "kpi-uuid-123",
"name": "payment_processing_errors",
"description": "Failed payment transactions per minute",
"score": 0.92
},
{
"id": "kpi-uuid-456",
"name": "transaction_timeout_total",
"description": "Payment transactions timing out",
"score": 0.85
}
]
}
Retrieving a Single KPI
GET /api/v1/kpi/defs/{id}
Example:
GET /api/v1/kpi/defs/f47ac10b-58cc-4372-a567-0e02b2c3d479
Response:
{
"id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"name": "payment_processing_errors",
"kind": "business",
"layer": "impact",
"signalType": "metrics",
"sentiment": "negative",
"formula": "sum(rate(payment_errors_total[1m]))",
"definition": "Failed payment transactions per minute",
"businessImpact": "Direct revenue loss",
"description": "Critical business KPI tracking payment processing health",
"tags": ["payments", "critical", "revenue"],
"createdAt": "2026-01-20T10:00:00Z",
"updatedAt": "2026-01-20T10:00:00Z"
}
Updating a KPI
Updates use the same endpoint as creation (upsert behavior):
POST /api/v1/kpi/defs
Content-Type: application/json
{
"kpiDefinition": {
"id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"name": "payment_processing_errors",
"formula": "sum(rate(payment_errors_total[5m]))", # Changed from 1m to 5m
"description": "Updated: 5-minute aggregation window"
}
}
Response (200 OK):
{
"status": "ok",
"id": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}
Deleting a KPI
DELETE /api/v1/kpi/defs/{id}
Example:
DELETE /api/v1/kpi/defs/f47ac10b-58cc-4372-a567-0e02b2c3d479
Response (200 OK):
{
"status": "deleted",
"id": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}
Component 2: Failure Detection
What is Failure Detection?
Failure Detection analyzes telemetry data (metrics, logs, traces) within a time window to identify component failures. It:
Detects error spans: Traces with
error=truetagIdentifies anomalies: Metrics with
iforest_is_anomaly=true(from AI anomaly detection)Groups by service+component: Aggregates failures per component
Generates unique IDs: Deterministic UUIDs for deduplication
Persists to storage: Saves failures to Weaviate for historical analysis
When to Use Failure Detection
Incident investigation: “What failed between 10:00 and 11:00?”
Post-mortem analysis: Identify all affected components during an outage
Pre-RCA preparation: Gather failure candidates before running RCA
Monitoring dashboards: Track failure trends over time
Detecting Failures
Basic Detection (All Components):
POST /api/v1/unified/failures/detect
Content-Type: application/json
{
"time_range": {
"start": "2026-01-23T10:00:00Z",
"end": "2026-01-23T11:00:00Z"
}
}
Filtered Detection (Specific Components):
POST /api/v1/unified/failures/detect
Content-Type: application/json
{
"time_range": {
"start": "2026-01-23T10:00:00Z",
"end": "2026-01-23T11:00:00Z"
},
"components": ["kafka", "cassandra", "api-gateway"],
"services": ["payment-service", "auth-service"]
}
Response:
{
"incidents": [
{
"incident_id": "incident_kafka_1737626400",
"failure_id": "kafka-producer-kafka-20260123-100000",
"failure_uuid": "e458d90f-f525-58a9-9e92-9f91faa73cf2",
"time_range": {
"start": "2026-01-23T10:00:00Z",
"end": "2026-01-23T10:15:00Z"
},
"primary_component": "kafka",
"affected_transaction_ids": ["txn-123", "txn-456"],
"services_involved": ["payment-service", "order-service"],
"failure_mode": "error_spans",
"confidence": 0.92,
"severity": "high"
}
],
"summary": {
"total_incidents": 3,
"time_range": {
"start": "2026-01-23T10:00:00Z",
"end": "2026-01-23T11:00:00Z"
},
"service_component_summaries": [
{
"service": "payment-service",
"component": "kafka",
"failure_id": "kafka-producer-kafka-20260123-100000",
"failure_uuid": "e458d90f-f525-58a9-9e92-9f91faa73cf2",
"failure_count": 42,
"affected_transactions": 15,
"average_anomaly_score": 0.87,
"average_confidence": 0.92,
"error_spans_count": 38,
"error_metrics_count": 4,
"last_failure_timestamp": "2026-01-23T10:14:32Z"
}
],
"metrics_error_summary": {
"total_error_metrics": 12,
"total_anomaly_metrics": 8,
"error_metrics_by_name": [
{
"metric_name": "kafka_producer_errors_total",
"count": 8,
"average_value": 3.5,
"last_timestamp": "2026-01-23T10:14:32Z"
}
],
"anomaly_metrics_by_name": [
{
"metric_name": "kafka_producer_latency_ms",
"count": 5,
"average_value": 1250.3,
"last_timestamp": "2026-01-23T10:14:28Z"
}
]
}
}
}
Understanding Failure IDs
Each failure gets two identifiers:
failure_id (human-readable):
{service}-{component}-{YYYYMMDD-HHMMSS}Example:
kafka-producer-kafka-20260123-100000Easy to read in logs and dashboards
failure_uuid (deterministic UUID v5):
e458d90f-f525-58a9-9e92-9f91faa73cf2Unique identifier for storage and deduplication
Generated from: service + component + timestamp
Same failure detected multiple times = same UUID
Transaction Failure Correlation
Correlate failures for specific transaction IDs to track cascading failures:
POST /api/v1/unified/failures/correlate
Content-Type: application/json
{
"transactionIDs": ["txn-12345", "txn-67890"],
"time_range": {
"start": "2026-01-23T10:00:00Z",
"end": "2026-01-23T11:00:00Z"
}
}
Response:
{
"incidents": [
{
"incident_id": "incident_txn_12345",
"failure_id": "transaction-correlation-20260123-100000",
"affected_transaction_ids": ["txn-12345"],
"services_involved": ["api-gateway", "payment-service", "kafka"],
"failure_sequence": [
{
"service": "api-gateway",
"component": "http-server",
"timestamp": "2026-01-23T10:00:05Z",
"error": "timeout waiting for payment-service"
},
{
"service": "payment-service",
"component": "kafka-producer",
"timestamp": "2026-01-23T10:00:04Z",
"error": "kafka broker unavailable"
}
],
"root_component": "kafka",
"confidence": 0.95
}
]
}
Listing Stored Failures
Retrieve paginated list of historical failures:
POST /api/v1/unified/failures/list
Content-Type: application/json
{
"limit": 20,
"offset": 0,
"filters": {
"severity": "high",
"component": "kafka"
}
}
Response:
{
"failures": [
{
"id": "e458d90f-f525-58a9-9e92-9f91faa73cf2",
"failure_id": "kafka-producer-kafka-20260123-100000",
"summary": "Kafka producer failures in payment-service",
"severity": "high",
"timestamp": "2026-01-23T10:00:00Z",
"component": "kafka",
"service": "payment-service"
}
],
"total": 150,
"nextOffset": 20
}
Getting Failure Details
Retrieve full failure record with all signals and metadata:
POST /api/v1/unified/failures/get
Content-Type: application/json
{
"id": "e458d90f-f525-58a9-9e92-9f91faa73cf2"
}
Response:
{
"failure": {
"id": "e458d90f-f525-58a9-9e92-9f91faa73cf2",
"failure_id": "kafka-producer-kafka-20260123-100000",
"time_range": {
"start": "2026-01-23T10:00:00Z",
"end": "2026-01-23T10:15:00Z"
},
"primary_component": "kafka",
"services": "payment-service",
"affected_transaction_count": 15,
"error_signals": [
{
"timestamp": "2026-01-23T10:00:05Z",
"service": "payment-service",
"component": "kafka-producer",
"error_message": "broker unavailable",
"trace_id": "abc123",
"span_id": "def456"
}
],
"anomaly_signals": [
{
"timestamp": "2026-01-23T10:00:03Z",
"metric_name": "kafka_producer_latency_ms",
"value": 1250.3,
"is_anomaly": true,
"anomaly_score": 0.92
}
],
"metadata": {
"detection_timestamp": "2026-01-23T10:16:00Z",
"detection_confidence": 0.92,
"total_signals": 42
}
}
}
Deleting a Failure
Remove a failure record from storage:
POST /api/v1/unified/failures/delete
Content-Type: application/json
{
"id": "e458d90f-f525-58a9-9e92-9f91faa73cf2"
}
Response:
{
"status": "deleted",
"id": "e458d90f-f525-58a9-9e92-9f91faa73cf2"
}
Component 3: Correlation Analysis
What is Correlation Analysis?
Correlation Analysis performs statistical analysis between KPIs to discover relationships and patterns. It:
Builds temporal rings: Divides time window into rings (R1: immediate, R2: short, R3: medium, R4: long)
Discovers impact KPIs: Identifies metrics showing degradation (red anchors)
Finds candidate causes: Detects correlated metrics that may explain the impact
Computes statistics: Pearson, Spearman, cross-correlation, partial correlation
Scores candidates: Assigns suspicion scores based on statistical strength
Correlation vs RCA
Aspect |
Correlation |
RCA |
|---|---|---|
Purpose |
Find statistical relationships |
Explain root cause |
Output |
List of correlated KPIs with scores |
5-WHY chains with narrative |
Method |
Statistical analysis |
Correlation + reasoning |
Use Case |
“What else changed?” |
“Why did it fail?” |
Key Insight: RCA uses Correlation results internally, then adds causal reasoning.
Running Correlation Analysis
Time-Window Correlation (Recommended):
POST /api/v1/unified/correlation
Content-Type: application/json
{
"startTime": "2026-01-23T10:00:00Z",
"endTime": "2026-01-23T11:00:00Z"
}
Response:
{
"status": "success",
"result": {
"correlationID": "corr_1737626400",
"timeRange": {
"start": "2026-01-23T10:00:00Z",
"end": "2026-01-23T11:00:00Z"
},
"rings": {
"R1_IMMEDIATE": {
"label": "R1_IMMEDIATE",
"description": "Anomalies very close to the peak",
"duration": "5s",
"start": "2026-01-23T10:55:55Z",
"end": "2026-01-23T11:00:00Z"
},
"R2_SHORT": {
"label": "R2_SHORT",
"description": "Anomalies shortly before peak",
"duration": "30s",
"start": "2026-01-23T10:55:25Z",
"end": "2026-01-23T10:55:55Z"
},
"R3_MEDIUM": {
"label": "R3_MEDIUM",
"description": "Anomalies moderately before peak",
"duration": "2m",
"start": "2026-01-23T10:53:25Z",
"end": "2026-01-23T10:55:25Z"
},
"R4_LONG": {
"label": "R4_LONG",
"description": "Anomalies further back",
"duration": "10m",
"start": "2026-01-23T10:43:25Z",
"end": "2026-01-23T10:53:25Z"
}
},
"affectedServices": ["payment-service", "kafka"],
"redAnchors": [
{
"service": "payment-service",
"metric": "payment_processing_errors",
"score": 0.95,
"ring": "R1_IMMEDIATE",
"labelFingerprint": {
"service.name": "payment-service",
"region": "us-east-1"
}
}
],
"causes": [
{
"kpi": "kafka_producer_latency_ms",
"service": "payment-service",
"suspicionScore": 0.89,
"ring": "R2_SHORT",
"reasons": [
"high_pearson_correlation",
"high_spearman_correlation",
"temporal_precedence",
"high_anomaly_density"
],
"stats": {
"pearson": 0.87,
"spearman": 0.91,
"crossCorrMax": 0.88,
"crossCorrLag": -2,
"partial": 0.82
},
"labelFingerprint": {
"service.name": "payment-service",
"component": "kafka-producer"
}
},
{
"kpi": "kafka_broker_connection_errors",
"service": "kafka",
"suspicionScore": 0.92,
"ring": "R3_MEDIUM",
"reasons": [
"high_pearson_correlation",
"temporal_precedence",
"upstream_component"
],
"stats": {
"pearson": 0.93,
"spearman": 0.89,
"crossCorrMax": 0.91,
"crossCorrLag": -5,
"partial": 0.85
}
}
],
"confidence": 0.91,
"createdAt": "2026-01-23T11:01:00Z"
}
}
Understanding Correlation Results
Red Anchors:
Metrics showing impact (business/user-facing degradation)
Typically from KPIs with
layer=impactHigh scores = strong impact signal
Cause Candidates:
Metrics that correlate with red anchors
Typically from KPIs with
layer=causeRanked by suspicion score (higher = more suspicious)
Statistics Explained:
Metric |
Meaning |
Range |
|---|---|---|
Pearson |
Linear correlation strength |
-1 to +1 |
Spearman |
Rank correlation (monotonic relationship) |
-1 to +1 |
CrossCorrMax |
Maximum cross-correlation |
-1 to +1 |
CrossCorrLag |
Time lag (negative = cause precedes impact) |
seconds |
Partial |
Correlation after removing confounders |
-1 to +1 |
Suspicion Score Calculation:
suspicionScore = weighted_average(
pearson_correlation,
spearman_correlation,
cross_correlation_max,
partial_correlation,
anomaly_density,
temporal_precedence
)
Reasons (Why This Candidate is Suspicious):
high_pearson_correlation: Strong linear relationshiphigh_spearman_correlation: Strong monotonic relationshiptemporal_precedence: Cause occurred before impacthigh_anomaly_density: Many anomalies in this metricupstream_component: Component is upstream in service graph
Time Window Constraints
Correlation enforces time window limits from configuration:
engine:
min_window: 1m # Minimum analysis window
max_window: 1h # Maximum analysis window
Invalid Windows:
# Too small
{"startTime": "2026-01-23T10:00:00Z", "endTime": "2026-01-23T10:00:30Z"}
# Error: time window too small: 30s < minWindow 1m
# Too large
{"startTime": "2026-01-23T00:00:00Z", "endTime": "2026-01-23T23:59:59Z"}
# Error: time window too large: 23h59m59s > maxWindow 1h
Component 4: Root Cause Analysis (RCA)
What is RCA?
Root Cause Analysis (RCA) combines:
Correlation results (statistical relationships)
5 WHY methodology (iterative questioning)
Service topology (upstream/downstream relationships)
Temporal rings (time-based evidence)
To produce human-readable explanations of why incidents occurred.
RCA Process Flow
1. User provides time window
↓
2. RCA engine calls Correlation engine
↓
3. Correlation returns:
- Red anchors (impacts)
- Cause candidates (correlated metrics)
- Statistical evidence
↓
4. RCA builds 5-WHY chains:
- WHY 1: Business impact (what failed?)
- WHY 2: Entry service degradation
- WHY 3-5: Upstream causes (evidence-driven)
↓
5. Returns structured narrative with:
- Impact summary
- Causal chains
- Time rings
- Diagnostic details
Running RCA
Basic RCA Request:
POST /api/v1/unified/rca
Content-Type: application/json
{
"startTime": "2026-01-23T10:00:00Z",
"endTime": "2026-01-23T11:00:00Z"
}
Response:
{
"status": "success",
"data": {
"impact": {
"id": "incident_payment_service_1737626400",
"impactService": "payment-service",
"metricName": "payment_processing_errors",
"timeStart": "2026-01-23T10:00:00Z",
"timeEnd": "2026-01-23T11:00:00Z",
"impactSummary": "Impact detected on payment-service (correlation confidence 0.91). Top-candidate kafka_producer_latency_ms: pearson=0.87 spearman=0.91 partial=0.82 cross_max=0.88 lag=-2 anomalies=HIGH",
"severity": 0.91
},
"chains": [
{
"steps": [
{
"why": 1,
"service": "payment-service",
"kpiName": "payment_processing_errors",
"timeRange": {
"start": "2026-01-23T10:00:00Z",
"end": "2026-01-23T11:00:00Z"
},
"ring": "R1_IMMEDIATE",
"direction": "SAME",
"score": 0.91,
"evidence": [
{
"type": "red_anchor",
"key": "payment-service",
"value": "anchor_score=0.950"
}
],
"summary": "Payment processing failed: payment_processing_errors increased dramatically in R1_IMMEDIATE (0.91 confidence)"
},
{
"why": 2,
"service": "payment-service",
"kpiName": "payment_processing_errors",
"timeRange": {
"start": "2026-01-23T10:00:00Z",
"end": "2026-01-23T11:00:00Z"
},
"ring": "R1_IMMEDIATE",
"direction": "UPSTREAM",
"score": 0.95,
"evidence": [
{
"type": "red_anchor",
"key": "payment-service",
"value": "metric=payment_processing_errors score=0.950"
}
],
"summary": "payment-service degraded: payment_processing_errors showed anomalies in R1_IMMEDIATE (0.95 confidence)"
},
{
"why": 3,
"service": "payment-service",
"kpiName": "kafka_producer_latency_ms",
"timeRange": {
"start": "2026-01-23T10:00:00Z",
"end": "2026-01-23T11:00:00Z"
},
"ring": "R2_SHORT",
"direction": "UPSTREAM",
"score": 0.89,
"evidence": [
{
"type": "correlation_stats",
"key": "kafka_producer_latency_ms",
"value": "pearson=0.87 spearman=0.91 cross_lag=-2 suspicion=0.89"
},
{
"type": "correlation_reason",
"key": "kafka_producer_latency_ms",
"value": "high_pearson_correlation"
},
{
"type": "correlation_reason",
"key": "kafka_producer_latency_ms",
"value": "temporal_precedence"
}
],
"summary": "Upstream component kafka_producer_latency_ms caused payment-service degradation (0.89 suspicion, pearson=0.87)"
},
{
"why": 4,
"service": "kafka",
"kpiName": "kafka_broker_connection_errors",
"timeRange": {
"start": "2026-01-23T10:00:00Z",
"end": "2026-01-23T11:00:00Z"
},
"ring": "R2_SHORT",
"direction": "UPSTREAM",
"score": 0.92,
"evidence": [
{
"type": "correlation_stats",
"key": "kafka_broker_connection_errors",
"value": "pearson=0.93 spearman=0.89 cross_lag=-5 suspicion=0.92"
},
{
"type": "correlation_reason",
"key": "kafka_broker_connection_errors",
"value": "high_pearson_correlation"
},
{
"type": "correlation_reason",
"key": "kafka_broker_connection_errors",
"value": "upstream_component"
}
],
"summary": "Upstream component kafka_broker_connection_errors caused payment-service degradation (0.92 suspicion, pearson=0.93)"
}
],
"score": 0.91,
"confidence": 0.91
}
],
"generatedAt": "2026-01-23T11:01:30Z",
"score": 0.91,
"notes": [],
"diagnostics": {},
"timeRings": {
"definitions": {
"R1_IMMEDIATE": {
"label": "R1_IMMEDIATE",
"description": "Anomalies very close to the peak",
"duration": "5s"
},
"R2_SHORT": {
"label": "R2_SHORT",
"description": "Anomalies shortly before peak",
"duration": "30s"
},
"R3_MEDIUM": {
"label": "R3_MEDIUM",
"description": "Anomalies moderately before peak",
"duration": "2m"
},
"R4_LONG": {
"label": "R4_LONG",
"description": "Anomalies further back",
"duration": "10m"
}
},
"perChain": []
}
},
"timestamp": "2026-01-23T11:01:30Z"
}
Understanding RCA Output
Impact Section:
impactService: Which service was affected
metricName: What metric degraded
impactSummary: Human-readable summary with statistical evidence
severity: Confidence score (0-1)
Chains (5-WHY Chains):
Each chain represents one possible root cause path:
Step Structure:
why: Step number (1-5)
service: Service involved at this step
kpiName: Metric/KPI at this step
ring: Temporal ring (when it occurred)
direction:
SAME(impact),UPSTREAM(cause),DOWNSTREAM(effect)score: Confidence for this step
evidence: Statistical and correlation evidence
summary: Human-readable explanation
Chain Scoring:
Weighted average: earlier steps (WHY 1-2) weighted higher
Higher score = more confident root cause path
Multiple chains = multiple possible root causes (sorted by score)
Time Rings:
R1_IMMEDIATE (5s): Events very close to peak
R2_SHORT (30s): Events shortly before peak
R3_MEDIUM (2m): Events moderately before peak
R4_LONG (10m): Events further back
Rings help identify temporal ordering (cause precedes effect).
Interpreting a 5-WHY Chain
Example chain interpretation:
WHY 1 (Business Impact): "Payment processing failed"
→ What the user experienced
→ Business/revenue impact
WHY 2 (Entry Service): "payment-service degraded"
→ Which service exhibited the problem
→ Where the impact manifested
WHY 3 (Direct Cause): "kafka_producer_latency_ms increased"
→ Immediate technical cause
→ Component directly affecting service
WHY 4 (Upstream Cause): "kafka_broker_connection_errors occurred"
→ Root infrastructure issue
→ What actually triggered the cascade
WHY 5 (Optional): Further upstream causes if available
Low Confidence RCA
If no correlation data is available:
Request:
{
"startTime": "2026-01-01T00:00:00Z",
"endTime": "2026-01-01T00:05:00Z"
}
Response:
{
"status": "success",
"data": {
"impact": {
"id": "incident_unknown",
"impactService": "unknown",
"metricName": "unknown",
"impactSummary": "No correlation data for window 2026-01-01 00:00:00 +0000 UTC - 2026-01-01 00:05:00 +0000 UTC",
"severity": 0
},
"chains": [],
"score": 0,
"notes": ["Correlation produced no candidates; returning low-confidence RCA"]
}
}
This indicates:
No KPIs were found with degradation in this window
Or KPI registry is empty
Or VictoriaMetrics/Logs/Traces have no data for this period
Resolution:
Verify KPIs are defined
Check time window contains actual incidents
Ensure telemetry data exists in VictoriaMetrics/Logs/Traces
Complete Workflows
Workflow 1: Full Incident Investigation
Scenario: Payment processing outage on Jan 23, 2026 between 10:00-11:00
Step 1: Define KPIs (if not already done)
# Define impact KPI
POST /api/v1/kpi/defs
{
"kpiDefinition": {
"name": "payment_processing_errors",
"layer": "impact",
"sentiment": "negative",
"signalType": "metrics",
"formula": "sum(rate(payment_errors_total[1m]))",
"description": "Failed payment transactions"
}
}
# Define cause KPIs
POST /api/v1/kpi/defs
{
"kpiDefinition": {
"name": "kafka_producer_latency_ms",
"layer": "cause",
"sentiment": "negative",
"signalType": "metrics",
"formula": "histogram_quantile(0.99, kafka_producer_latency_bucket)",
"description": "Kafka producer p99 latency"
}
}
POST /api/v1/kpi/defs
{
"kpiDefinition": {
"name": "database_connection_pool_exhausted",
"layer": "cause",
"sentiment": "negative",
"signalType": "metrics",
"formula": "db_pool_active / db_pool_max > 0.95",
"description": "Database connection pool near capacity"
}
}
Step 2: Detect Failures
POST /api/v1/unified/failures/detect
{
"time_range": {
"start": "2026-01-23T10:00:00Z",
"end": "2026-01-23T11:00:00Z"
}
}
Result: Identified failures in:
payment-service (kafka component)
database-service (connection pool)
api-gateway (timeouts)
Step 3: Run Correlation Analysis
POST /api/v1/unified/correlation
{
"startTime": "2026-01-23T10:00:00Z",
"endTime": "2026-01-23T11:00:00Z"
}
Result: Correlation found:
payment_processing_errors (impact)
kafka_producer_latency_ms (cause, suspicion=0.89)
database_connection_pool_exhausted (cause, suspicion=0.75)
Step 4: Run RCA
POST /api/v1/unified/rca
{
"startTime": "2026-01-23T10:00:00Z",
"endTime": "2026-01-23T11:00:00Z"
}
Result: RCA chain:
WHY 1: Payment processing failed (business impact)
WHY 2: payment-service degraded (errors increased)
WHY 3: kafka_producer_latency_ms spiked (direct cause)
WHY 4: kafka_broker_connection_errors occurred (root cause)
Conclusion: Kafka broker connection failures caused producer latency, leading to payment processing errors.
Workflow 2: Proactive Monitoring Setup
Scenario: Set up monitoring for a new microservice
Step 1: Bulk Import KPIs
# Create kpis.csv
name,layer,sentiment,signalType,formula,description
user_service_api_latency_p99,impact,negative,metrics,histogram_quantile(0.99\\, rate(http_request_duration_seconds_bucket{service=\"user-service\"}[5m])),API latency affecting users
user_service_error_rate,impact,negative,metrics,sum(rate(http_requests_total{service=\"user-service\"\\,status=~\"5..\"}[1m])) / sum(rate(http_requests_total{service=\"user-service\"}[1m])),Error rate impacting reliability
user_service_cpu_usage,cause,negative,metrics,avg(rate(process_cpu_seconds_total{service=\"user-service\"}[5m])) * 100,CPU utilization
user_service_memory_usage,cause,negative,metrics,process_resident_memory_bytes{service=\"user-service\"} / 1024 / 1024,Memory usage in MB
user_service_db_query_latency,cause,negative,metrics,histogram_quantile(0.95\\, rate(db_query_duration_seconds_bucket{service=\"user-service\"}[5m])),Database query latency
# Import
POST /api/v1/kpi/defs/bulk-csv < kpis.csv
Step 2: Verify KPIs
GET /api/v1/kpi/defs?tags=user-service&limit=10
Step 3: Test Correlation
# Run correlation for last hour
POST /api/v1/unified/correlation
{
"startTime": "2026-01-23T10:00:00Z",
"endTime": "2026-01-23T11:00:00Z"
}
Step 4: Schedule Periodic RCA
Set up a cron job or monitoring system to:
# Run RCA every 15 minutes for the last 15 minutes
*/15 * * * * curl -X POST http://mirador-core:8010/api/v1/unified/rca \
-H "Content-Type: application/json" \
-d "{\"startTime\":\"$(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%SZ)\",\"endTime\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}"
Workflow 3: Historical Incident Analysis
Scenario: Analyze pattern of failures over past week
Step 1: List Failures
POST /api/v1/unified/failures/list
{
"limit": 100,
"offset": 0,
"filters": {
"start_time": "2026-01-16T00:00:00Z",
"end_time": "2026-01-23T00:00:00Z",
"severity": "high"
}
}
Step 2: Analyze Each Failure
# For each failure_uuid from step 1
POST /api/v1/unified/failures/get
{
"id": "e458d90f-f525-58a9-9e92-9f91faa73cf2"
}
Step 3: Run RCA for Each Incident
# For each incident time window
POST /api/v1/unified/rca
{
"startTime": "<incident_start>",
"endTime": "<incident_end>"
}
Step 4: Aggregate Patterns
Analyze RCA results to find:
Common root causes (e.g., kafka broker issues appearing in 80% of incidents)
Frequently affected services (e.g., payment-service)
Temporal patterns (e.g., failures cluster around 10 AM)
API Reference
Base URL
http://localhost:8010
Authentication
Mirador Core is designed to run behind an external API gateway or service mesh that handles authentication. No built-in authentication is required for API calls.
Content Type
All requests must include:
Content-Type: application/json
KPI Endpoints
Method |
Endpoint |
Description |
|---|---|---|
GET |
|
List KPI definitions |
POST |
|
Create/update KPI definition |
GET |
|
Get single KPI definition |
DELETE |
|
Delete KPI definition |
POST |
|
Bulk import KPIs (JSON) |
POST |
|
Bulk import KPIs (CSV) |
POST |
|
Semantic search for KPIs |
Failure Detection Endpoints
Method |
Endpoint |
Description |
|---|---|---|
POST |
|
Detect component failures |
POST |
|
Correlate transaction failures |
POST |
|
List stored failures |
POST |
|
Get failure details |
POST |
|
Delete failure record |
Correlation Endpoints
Method |
Endpoint |
Description |
|---|---|---|
POST |
|
Run correlation analysis |
RCA Endpoints
Method |
Endpoint |
Description |
|---|---|---|
POST |
|
Compute root cause analysis |
Internal Endpoints
Method |
Endpoint |
Description |
|---|---|---|
GET |
|
Health check |
GET |
|
Readiness check |
GET |
|
Prometheus metrics |
GET |
|
API v1 health |
GET |
|
OpenAPI spec (YAML) |
GET |
|
OpenAPI spec (JSON) |
GET |
|
Swagger UI |
Query Parameters
KPI List Filtering:
limit(int): Max results (default: 10, max: 10000)offset(int): Pagination offset (default: 0)tags(string[]): Filter by tags (comma-separated)layer(string): Filter by layer (impact/cause)sentiment(string): Filter by sentiment (positive/negative/neutral)signalType(string): Filter by signal type (metrics/logs/traces/business/synthetic)kind(string): Filter by kind (business/tech)classifier(string): Filter by classifierdatastore(string): Filter by datastore
Example:
GET /api/v1/kpi/defs?layer=impact&sentiment=negative&limit=50&tags=critical,payments
Troubleshooting
Problem: KPIs Not Being Detected by Correlation
Symptoms:
Correlation returns empty results
RCA shows “No correlation data”
Diagnosis:
# 1. Check if KPIs exist
GET /api/v1/kpi/defs?limit=10
# 2. Check if KPIs have formulas
GET /api/v1/kpi/defs/{id}
# Verify "formula" or "query" field is populated
# 3. Test formula manually against VictoriaMetrics
curl "http://victoriametrics:8428/api/v1/query?query=<your_formula>&time=$(date +%s)"
Solutions:
Ensure KPIs have valid
formulaorqueryfieldsVerify VictoriaMetrics contains data for the formula
Check time window overlaps with actual data availability
Problem: Failure Detection Returns Empty Results
Symptoms:
/unified/failures/detectreturns"total_incidents": 0
Diagnosis:
# 1. Check if traces exist
curl "http://victoriatraces:10428/api/v1/search?start=<start_epoch>&end=<end_epoch>&tags={}"
# 2. Check if metrics exist
curl "http://victoriametrics:8428/api/v1/query_range?query=up&start=<start>&end=<end>&step=60"
# 3. Verify time window format
# Must be RFC3339 UTC: "2026-01-23T10:00:00Z"
Solutions:
Ensure traces are being ingested to VictoriaTraces
Verify error spans have
error=truetagCheck anomaly metrics have
iforest_is_anomaly=truelabelConfirm time window contains actual incident data
Problem: RCA Returns Low Confidence
Symptoms:
RCA score is 0 or very low
Chains are empty
Notes contain: “Correlation produced no candidates”
Diagnosis:
# 1. Run correlation first to see what it finds
POST /api/v1/unified/correlation
{
"startTime": "...",
"endTime": "..."
}
# 2. Check correlation response
# If empty, diagnose correlation (see above)
# 3. Verify KPIs have layer=impact and layer=cause defined
GET /api/v1/kpi/defs?layer=impact
GET /api/v1/kpi/defs?layer=cause
Solutions:
Define at least one
layer=impactKPIDefine multiple
layer=causeKPIsEnsure time window contains degradation events
Run failure detection first to confirm incidents exist
Problem: Time Window Validation Errors
Symptoms:
400 Bad Request: time window too small413 Payload Too Large: time window too large
Diagnosis:
# Check engine configuration
curl http://localhost:8010/api/v1/unified/metadata | jq '.engineConfig'
Solutions:
Adjust window to respect
min_windowandmax_windowDefault constraints:
min_window: 1mmax_window: 1h
Update
configs/config.yamlif constraints are too restrictive
Problem: Weaviate Connection Failures
Symptoms:
KPI creation fails with “weaviate unavailable”
Failure detection works but failures aren’t persisted
Diagnosis:
# 1. Check Weaviate health
curl http://weaviate:8080/v1/.well-known/ready
# 2. Check Mirador Core logs
docker logs mirador-core | grep -i weaviate
# 3. Verify Weaviate is enabled in config
cat configs/config.yaml | grep -A5 weaviate
Solutions:
Ensure Weaviate container is running:
docker ps | grep weaviateVerify network connectivity:
docker network inspect mirador-netSet
weaviate.enabled: truein configRestart Mirador Core:
docker restart mirador-core
Problem: High Memory Usage
Symptoms:
Mirador Core container OOM killed
Slow response times
Diagnosis:
# Check container memory
docker stats mirador-core
# Check VictoriaMetrics data volume
docker exec victoriametrics du -sh /victoria-metrics-data
Solutions:
Increase container memory limit in docker-compose.yml:
deploy: resources: limits: memory: 4G
Reduce
default_query_limitin config.yamlEnable more aggressive caching (increase TTL)
Reduce
max_windowto limit analysis scope
Problem: Correlation Takes Too Long
Symptoms:
Correlation requests timeout
High CPU usage during correlation
Diagnosis:
# Check number of KPIs
GET /api/v1/kpi/defs | jq '.total'
# Check time window size
# Large windows = more data to analyze
Solutions:
Reduce time window (use 15m instead of 1h)
Reduce number of active KPIs (archive unused ones)
Increase timeout in config:
database: victoria_metrics: timeout: 60000 # 60 seconds
Scale Mirador Core horizontally (add more replicas)
Getting Help
Logs:
# Mirador Core logs
docker logs mirador-core --tail 100 -f
# All services logs
docker-compose logs -f
Health Checks:
# Overall health
curl http://localhost:8010/api/v1/health | jq
# Service status
curl http://localhost:8010/microservices/status | jq
Metrics:
# Prometheus metrics
curl http://localhost:8010/metrics | grep mirador
Support:
GitHub Issues: https://github.com/mirastacklabs-ai/mirador-core/issues
Documentation: http://localhost:8010/swagger/index.html
Summary
This guide covered the four core components of Mirador Core:
KPIs: Define and manage metrics (foundation)
Failures: Detect and track incidents
Correlation: Analyze statistical relationships
RCA: Explain root causes with 5-WHY methodology
Key Takeaways:
Always define KPIs before running failure detection or RCA
Use impact (
layer=impact) and cause (layer=cause) KPIs for best resultsCorrelation provides statistical evidence; RCA adds causal reasoning
Time windows must respect configured min/max constraints
All components are interconnected and build upon each other
For complete API documentation, see the Swagger UI.