The SRE Lead is responsible for driving reliability, resiliency, and performance across Birlasoftâs Platform Engineering ecosystem including microservices, cloud workloads, Cogito agentic operations, and enterprise applications.
Requirements
- Define & maintain SLOs, SLIs, and error budgets for platform services.
- Lead capacity planning, performance tuning, autoscaling strategies, and resilience testing.
- Drive reliability patterns such as graceful degradation, retry logic, and distributed failover.
- Own monitoring stack across Azure Monitor, App Insights, Log Analytics, OpenTelemetry, and AKS.
- Design alerting standards noise reduction, correlation, routing, escalation.
- Build health, reliability, and risk dashboards for leadership.
- Lead incident response, on-call processes, and blameless postmortems.
- Drive MTTR reduction through automation, playbooks, and predictive analytics.
- Establish proactive issue detection mechanisms using patterns, telemetry, and AIOps.
- Implement automation-first operations for remediation, self-healing, and repetitive tasks.
- Integrate AI-driven agent workflows with Cogito for troubleshooting, optimization, and cost-ops.
- Increase operational maturity through runbooks, autopilot actions, and integrated CI/CD reliability checks.
- Partner with Platform Engineering pods (Infra, Core, Integration, DevEx, Security) to embed reliability by design.
- Influence architecture for scalability, observability, and fault tolerance.
- Mentor SRE engineers and lead the maturity of SRE practices across accounts.