GEICO is seeking an experienced Engineer with a passion for building high-performance, low maintenance, zero-downtime platforms, and applications. The Senior Staff Engineer in Availability and Incident Management will design and deploy machine learning systems that enable intelligent incident detection, automated root cause analysis, and predictive reliability improvements across the platform.
Requirements
- Experience building and deploying ML systems in production with cross-functional engineering teams
- Fluency in at least two modern languages such as Python, Go, Java, C++, or C#
- Experience architecting multi-component ML platforms using open-source/cloud-agnostic components
- Experience with end-to-end ML lifecycle
- Experience with cloud providers (Azure, AWS or GCP) in production ML environments
- Experience with observability tools and distributed systems monitoring, logging, tracing, and root cause analysis
- Experience building multi-agent systems using LLMs and agentic frameworks
- Hands-on experience with RAG, semantic search, and vector databases
- Experience designing human-in-the-loop workflows and safety controls for autonomous systems
- Strong architecture and design skills
- Proven ability to solve complex problems with data-driven approaches
- Experience fine-tuning or deploying open-source LLMs is a plus
- Experience with data warehouse/lakehouse platforms is a plus
Benefits
- Comprehensive Total Rewards program
- 401K savings plan with 6% match
- Performance and recognition-based incentives
- Tuition assistance
- Mental healthcare
- Fertility and adoption assistance
- GEICO Flex program (work from anywhere in the US for up to four weeks per year)