SRE Archives - AiOps Community

Pod-Level Resource Managers and AIOps Signal Integrity

Observability Author - May 3, 2026

Kubernetes 1.36’s pod-level resource managers reshape more than scheduling—they redefine observability signals. Here’s how memory QoS and pod-scoped controls impact AIOps baselines, forecasting, and automation.

When Infrastructure Lies: Drift, Staleness, and AIOps Truth

DevSecOps in AIOps Author - May 3, 2026

Terraform shows green. Controllers report success. Production still fails. This analysis reframes AIOps as a truth-detection layer above declarative systems.

Comprehensive Guide to AI Observability Tools

DevSecOps in AIOps Author - May 3, 2026

Explore a comprehensive guide to AI observability tools, comparing architecture, features, and performance to help teams make informed decisions.

The Agent Trust Blueprint for AI in Production Pipelines

DevSecOps in AIOps Author - May 3, 2026

A rigorous blueprint for calibrating trust in AI agents across CI/CD and production workflows. Learn how to combine confidence scoring, guardrails, human review, and progressive autonomy.

The Velocity Trap: When DevOps Speed Breaks Reliability

DevOps in AIOps Tutorials Author - May 3, 2026

AI is accelerating DevOps delivery—but at what cost? Explore how velocity, error budgets, and AIOps must align to prevent systemic fragility and SLO debt.

Calibrated Trust: Governing AI Agents in Production Ops

DevSecOps in AIOps Author - May 3, 2026

AI agents are entering production pipelines, but autonomy without governance creates systemic risk. Explore a calibrated trust model and architectural patterns for safe AIOps adoption.

Kubernetes 1.36 Observability Changes SREs Must Address

Observability Author - April 30, 2026

Kubernetes 1.36 tightens staleness handling and kubelet authorization. Here’s what those changes mean for AIOps signal quality and production observability.

Building a Runbook-Aware AI Investigator on Kubernetes

DevOps in AIOps Tutorials Author - April 30, 2026

Learn how to build a runbook-aware AI incident investigator on Kubernetes using events, OpenTelemetry, and structured guardrails for safe, transparent diagnostics.

Operationalizing AI Agents in IT Ops with Guardrails and SLOs

Advanced Concepts dev ops - April 30, 2026

A practical framework for running AI agents in production IT Ops. Learn how to define agent SLOs, implement guardrails, model failure modes, and design safe rollback strategies.

Continuous Profiling in AIOps: From Pyroscope to Production

Observability dev ops - April 30, 2026

A practitioner’s blueprint for operationalizing continuous profiling in AIOps. Learn how to connect profiles with metrics, traces, and ML for automated performance optimization.

Continuous Profiling in AIOps: From Pyroscope to Production

DevOps in AIOps Tutorials dev ops - April 30, 2026

Learn how to integrate continuous profiling into your AIOps pipeline. Correlate profiles with incidents, reduce noisy workloads, and accelerate root cause analysis in production.

Auto-Diagnosing Kubernetes with an AI Investigation Pipeline

DevOps in AIOps Tutorials Sanju - April 30, 2026

Build an end-to-end AI-powered Kubernetes investigation workflow using OpenTelemetry, structured runbooks, and LLM reasoning—complete with prompts and evaluation guidance.

Tag: SRE