About the Role
This role focuses on advancing AI-driven incident investigation capabilities within Kubernetes environments and observability tools. Elastic has introduced an agentic workflow that automatically launches diagnostic processes the moment an alert fires, aiming to eliminate the critical gap between receiving an alert and understanding its root cause. For organizations operating Kubernetes at scale, this position involves building systems that assemble evidence and surface recommended remediation steps before a site reliability engineer (SRE) even opens the ticket.
Key Responsibilities
- Design and implement agentic investigation workflows that autonomously diagnose incidents in Kubernetes clusters.
- Develop MCP-based observability skills that integrate with existing monitoring and alerting infrastructure.
- Engineer solutions that identify root causes, compile forensic evidence, and suggest actionable next steps automatically.
- Collaborate with SRE and platform engineering teams to reduce mean time to resolution (MTTR) for cloud-native incidents.
- Optimize incident response pipelines to alleviate on-call burnout and prevent outage compounding.
- Contribute to the evolution of AI-powered observability tooling that closes the gap between alert and answer.
- Maintain and enhance integrations across the Elastic Stack and Kubernetes ecosystem.
Requirements
- Deep expertise in Kubernetes architecture, operations, and troubleshooting at enterprise scale.
- Proficiency with observability platforms and AI/ML-driven diagnostic tooling.
- Strong software engineering background with experience building automated workflows and agentic systems.
- Familiarity with Elasticsearch, the Elastic Stack, and modern incident management practices.
- Understanding of site reliability engineering principles and the challenges faced by on-call teams.
- Ability to work cross-functionally with security, infrastructure, and platform engineering stakeholders.
Compensation & Benefits
- Competitive compensation package aligned with industry standards for technical roles.
- Comprehensive health and wellness benefits designed to support employees and their families.
- Flexible work arrangements, including distributed and remote-friendly options.
- Professional development opportunities and access to cutting-edge AI and cloud-native projects.
- Equity participation and retirement savings plans, depending on location.
How to Apply
Interested candidates can apply directly via the Apply Now button on this page, which will redirect to the full job listing hosted by Elastic. Review the complete requirements and submit an application through the original posting for consideration. For those exploring adjacent opportunities in cybersecurity and AI, roles like Anthropic Seeks Cybersecurity Researchers to Advance N-Day Exploit System and Rockwell Automation Seeks OT Cybersecurity Experts for SecureOT AI Expansion also highlight the industry’s growing investment in intelligent security and observability tooling.