Lead Infrastructure Engineer for enterprise Gen AI applications hosted on OpenShift platforms. Manage end-to-end infrastructure, capacity planning, and disaster recovery.
Requirements
- Lead and manage end-to-end infrastructure for enterprise Gen AI applications hosted on OpenShift platforms
- Own capacity planning and sizing for OpenShift clusters, including OCP pods, Oracle databases, Redis caches, Dell ECS storage, Elastic DB, Postgres, Redhat, Ubuntu (optional) and related infrastructure components
- Design and operationalize Disaster Recovery (DR) infrastructure for Gen AI platforms, ensuring high availability and resilience
- Manage certificate lifecycle (TLS/SSL), key management, and secrets handling across Gen AI applications and platforms
- Implement and oversee vulnerability management, patching, and remediation across containers, Kubernetes, and underlying infrastructure
- Support and coordinate penetration testing activities, addressing infrastructure-related findings and security gaps
- Good understanding of AWS services (EC2, VPC, CloudWatch, Lambda, Bedrock) and tools (Terraform, CloudFormation) alongside on Prem OpenShift environments
- Operate and support Control-M schedulers, logging, monitoring, and alerting tools for platform observability