Responsible for the performance, availability, and reliability of our cloud-based services and underlying infrastructure, acting as a critical technical subject matter expert.
Requirements
- Manage, troubleshoot, and optimize containerized applications and infrastructure deployed on Kubernetes, Red Hat OpenShift, and OpenStack platforms.
- Serve as the Subject Matter Expert (SME) for core cloud infrastructure technologies, including advanced Linux (CentOS) system administration, Docker/Containers, and complex networking configurations.
- Lead the investigation and resolution of complex, high-severity customer issues, applying strong analytical knowledge to quickly diagnose problems across the entire cloud stack.
- Develop, test, and maintain robust automation scripts using Python and Ansible to streamline daily operational tasks and improve overall service efficiency.
- Provide end-to-end Escalation, Monitoring, and Emergency (EME) support, acting as a final escalation point to ensure service availability and meet SLAs.
- Liaise directly with the customer team and internal teams to understand requirements and deliver tailored technical solutions.
Benefits
- Scheduling flexibility to ensure 24/7 coverage and rapid response times
- Rotational on-call schedule with direct customer contact and war room engagement