Autonomous SRE Agent

Your AI-powered
SRE teammate

An intelligent agent that knows our architecture, reads our logs, queries our metrics, searches our incidents, and investigates production issues like a senior engineer who’s been here since day one.

18
Capabilities
Auto
End-to-End Investigations
<4m
Time to Triage
Live
Code ↔ Logs Correlation
24/7
Senior-Level Context
Zero
Context Switching
At a Glance

What can it do?

Give it an alert, an error, or a question about the system. Here’s what it brings to the table.

🔍

Query Incident History

Searches past incident tickets by alert name, service, or error pattern. Instantly surfaces past root causes, resolutions, and open action items before you even look at logs. Pattern matching against your team’s institutional memory.

📖

Find Runbooks & Documentation

Given an alert name, finds the matching runbook with battle-tested investigation steps. Dozens of runbook pages and knowledge base articles at its fingertips.

📊

Query Metrics

100+ custom metrics. CPU, memory, event loop delay, error rates, blast radius fleet metrics (pod count, restart sweep, session load, memory outlier detection).

📜

Logs & Log Analytics

Real-time queries with trace correlation. SQL-like aggregation for top-errors, deployment before/after comparison, blast radius scoping (tenants, sessions, pods), and infrastructure audit logs.

💻

Clone Any Service & Trace Stack Traces

Sees a stack trace in the logs? It clones the service repo, locates the exact source file and line, reads the code, and explains the logic flaw. Works across all services. Context-aware code correlation powered by deep architectural knowledge.

🏗️

Autonomous Architecture Learning & Service Deep Dives

Built through autonomous learning — the agent analyzes each service, mapping request flows, dependencies, and service criticality automatically. It understands the full cloud-native architecture, including the request lifecycle, layered security model, failure cascades, and dependency graph. Includes detailed insights into multiple services, covering critical paths, metrics, common failure modes with symptom-to-cause mapping, and key configuration details.

📦

Release Notes & Diffs

Compares any two releases. Shows merged PRs, linked tickets, authors. Finds which release contains a specific change.

🚨

Mandatory Blast Radius Assessment

Before root cause analysis, quantify the impact: affected tenants, sessions, pods, regions, and duration. Multi-dimension scoping with SLA classification (Negligible to Critical) and escalation guidance. Data-driven, never assumed.

🌳

Incident Scoping Decision Tree

Service A is down? Complete outage. Service B is down? Broken flow only. Structured blast radius determination for any failure.

How It Works

The Deterministic 8-Step Investigation Protocol

To guarantee the agent doesn’t hallucinate findings or skip critical diagnostic phases, its behavior is governed by a strictly sequenced protocol. It didn’t just run queries — it systematically encoded the exact method a senior engineer uses to diagnose complex systems, culminating in a structured investigation report.

1

Anchor the Timestamp

MANDATORY first step. Calculate absolute start/end times (±1h) to prevent temporal drift. Extract the alert’s exact UTC time, compute query boundaries, and state the window explicitly. All subsequent queries use these absolute timestamps — never relative time windows that silently drift.

2

Historical Synthesis

Query incident tracking and knowledge base systems for similar past incidents or runbooks to establish a known baseline. If a matching pattern is found, reference immediately — skip redundant investigation.

3

Triage & Blast Radius

Parse the alert and scope impact across tenants, sessions, services, and regions. Quantify the blast radius with data-driven metrics: who is affected, how many, and how severely. Determines escalation level — from “negligible” to “critical.”

4

System Pulse Check

Scan core metrics (CPU, memory, errors, latency) for correlated anomalies. Identify whether the signal is isolated or part of a broader systemic pattern across the infrastructure.

5

Log & Error Analysis

Aggregate and classify errors to map the failure surface across services and pods. Top-errors ranking, pod distribution, multi-service sweep — building the full error landscape.

6

Trace Deep-Dive

Follow the failing request via distributed traces to isolate the exact point of failure. Cross-service span hierarchies, slow span identification, and request flow visualization.

7

Code & Deployment Correlation

Link the issue to source code and recent deployments to identify the root cause. Locate the source file from stack traces, compare error rates before vs. after deploys, and surface the exact change that broke it.

8

Adversarial Verification

The mandatory adversarial gate. Re-validate all assumptions — time window, region, environment, queries. Does the root cause timestamp align with the alert? Does the evidence actually support the conclusion? Any mismatch invalidates the result. Born from a real misdiagnosis where a wrong time window produced a plausible but incorrect answer. A single CRITICAL failure invalidates the entire investigation.

Real Investigation

See it in action

A real alert came in at 4:53 PM on a Thursday. Here’s exactly what happened.

SRE Agent — Incident Investigation
Thursday, 4:53 PM
Y
help me help me !!!
i got an alert
❌ SyntheticTestNavigationFailures
test_flow: navigate-primary-lb
edge_pop: POP-EU-1
edge_region: region-eu-west
edge_ip: 192.0.2.170
environment_type: prod
severity: critical
tenant_id: TEST_TENANT_1
S
On it. Let me parse this alert before anyone panics more than you already are.
1
Timestamp anchored. Alert at Thu 4:53 PM → 2026-03-05T14:53:00Z UTC. Query window: 13:53–15:53 UTC.

synthetic monitoring alert, not a core service alert. Tenant TEST_TENANT_1 is the synthetic test tenant — not a real customer. POP: POP-EU-1, EDGE IP: 192.0.2.170.
2
Querying synthetic tester logs. Filtering for POP-EU-1 failures in the last hour.

reason: VALIDATION_FAILED enriched: EDGE_CONNECTIVITY_FAILED

Every failure reports envStatus: inactive and version v1.79.0. The primary-lb namespace looks like it’s been decommissioned.
S
3
Multi-POP aggregation. Expanding the query across all POPs to see if this is POP-EU-1-specific.

It is not. 174 failures across 6 POPs:
POPFailures
POP-EU-148
POP-EU-228
POP-US-126
POP-AS-125
POP-US-224
POP-ME-123
100% EDGE_CONNECTIVITY_FAILED — every single failure is the same pattern. This is systemic, not localized.
4:54 PM — Checking infrastructure logs...
S
4
Egress Router logs (region-eu-west). Found heavy upstream connectivity errors:

dial tcp 35.246.81.205:8080: i/o timeout — repeated across multiple primary-lb tenants.

Egress router can’t reach primary-lb endpoints — consistent with the v1.79.0 namespace being inactive/scaled down.
5
Multi-service sweep (CORE_SYS, region-eu-west). Swept all core services for errors:
ServiceErrorsNote
core-engine-service32Single tenant, single pod
api-gateway19Context canceled (normal)
edge-proxy-service0
auth-service0
The 32 engine errors are all from one tenant (CUSTOMER_TENANT_A) on one pod — a configuration issue, completely unrelated to this alert. No systemic CORE_SYS errors.
S
Verdict: Medium severity. No customers affected.

The primary-lb namespace v1.79.0 has been decommissioned as part of the v1.80.0 rollout. The egress router is still routing primary-lb traffic to endpoints that no longer accept connections, causing dial tcp i/o timeout across all POPs.

The synthetic tester config hasn’t been updated to reflect this transition — it keeps testing primary-lb and failing 174 times. All failures are from test tenants. Zero customer impact.
Next steps:
1. Confirm v1.79.0 decommission is intentional
2. Update synthetic tester config to stop testing primary-lb
3. Investigate CUSTOMER_TENANT_A profile issue separately
Y
ok so give me the incident brief, what do i put in the ticket?
S
Here’s your structured incident brief — copy-paste ready:
DimensionDetail
WhatMonitoring planeSyntheticTestNavigationFailures. primary-lb detected as inactive (EDGE_CONNECTIVITY_FAILED) across all POPs.
WherePOP POP-EU-1 (alert target), but failures span 6 POPs. CORE_SYS region: region-eu-west. Not POP-specific — systemic primary-lb issue.
When2026-03-05T14:53:00Z (UTC). First failures at ~15:08 UTC. Detected by monitoring alert.
WhoSynthetic test tenant TEST_TENANT_1 (not a real customer). Zero customer impact.
Whyprimary-lb namespace v1.79.0 decommissioned during v1.80.0 rollout. Egress router can no longer reach decommissioned endpoints.
MitigationUpdate synthetic tester config to stop testing primary-lb, or wait for self-suppression. No code change needed.
OwnerInfra/deployment team — synthetic tester config not updated post-rollout.
SeveritySEV-3 (Medium)
That’s your full scoping. The loudest alert in the queue turned out to be our own synthetic tester yelling at a namespace that’s already been retired. No customers harmed in the making of this incident.
Layered Architecture

All capabilities

Built with a layered approach: generic reusable skills that any agent can leverage, plus SRE-dedicated extensions that apply an operational perspective on top.

Executable Skill Layer — Generic Skills
📃

Log Queries

Real-time log queries across all environments with trace correlation, transaction-centric analysis, and scoping filters. Not SRE-specific — can be used by developers, QA, or other agents on demand.

📊

Deep Log Analytics

SQL-like aggregation over logs. Top-errors ranking, pod distribution, deployment before/after comparison for regression detection. Reusable for any analytical workflow.

📈

Metrics Queries

Query 100+ custom metrics. CPU, memory, event loop delay, error rates, fleet-level metrics (pod count, restart sweep, session load, memory outlier detection). Links to dashboards included.

🔗

Distributed Tracing

Trace API spans, cross-service latency, and request flow visualization. Follow a single request across the entire service mesh to pinpoint where it breaks.

💻

Clone & Read Source Code

Given a stack trace, clones the service repo, locates the exact source file and line, reads the code, and explains the logic. Works across all services. Reusable by any developer or agent.

📦

Release Notes & Diffs

Compares any two releases. Shows merged PRs, linked tickets, authors. Finds which release contains a specific change. Version-by-timestamp lookup and commit diffs.

🔍

Incident History Search

Searches past incident tickets by alert name, service, or error pattern. Surfaces past root causes, resolutions, and open action items. Pattern matching against institutional memory.

📖

Knowledge Base & Documentation Search

Full-text search across internal documentation, runbooks, and knowledge base articles. Alert-to-runbook mapping for instant context retrieval.

🛠️

Infrastructure Log Queries

Direct log queries against team-owned infrastructure components — synthetic testers, routers, proxies, and load balancers across multiple cloud projects and regions.

🔗

End-to-End Traffic Chain Visibility

Full observability across the traffic chain — from edge to backend. Cross-project log correlation: query routing layers for upstream errors, synthetic testers for external results, and core services for health — all in a single investigation flow.

SRE Overlay Layer — Operational Extensions
🚨

Blast Radius Assessment

Mandatory impact scoping before root cause analysis. Quantifies affected tenants, sessions, pods, regions, services, and duration. Data-driven escalation from “negligible” to “critical” — never assumed, always measured.

🛡️

Adversarial Investigation Verifier

Mandatory adversarial gate. Cross-checks every query against the original alert parameters: time window, cloud project, region, namespace, pod, and conclusion-evidence consistency. Born from a real misdiagnosis. A single CRITICAL failure invalidates the entire investigation.

🌳

Incident Scoping Decision Tree

Structured blast radius determination for any failure. Service A is down? Complete outage. Service B is down? Broken flow only. Instant service-failure-to-impact mapping using proven troubleshooting patterns.

Domain Knowledge Layer — Architecture & Context
🌍

Environment Reference

Every cloud project, cluster, region, namespace, endpoint, and health URL across all environments. Copy-paste query templates for any environment. The context needed to ask the right question.

🏗️

Architecture Overview

Request lifecycle, dependency graph, security model, communication topology, and failure cascade patterns. Built through autonomous learning — the agent mapped the system itself.

🧬

Service Deep Dives

Critical paths, configuration, metrics, failure modes, and symptom-to-cause mapping for every core service. Dense domain knowledge that enables accurate interpretation of raw telemetry.

📋

Service Catalog

Complete service registry: repo, language, role, dependencies, and alert patterns. Schemas, thresholds, and cross-service relationships — all reusable across other agents.

📝

Log Format Schema

Every log field documented. Log envelope structure, payload fields, tracing context, protocol-specific fields, and error shapes.