We are seeking an experienced Senior Linux System Administrator/System Support Engineer with expertise supporting High Performance Computing (HPC) environments to join our HPC support team.
Requirements
- Deploy, configure, maintain, and troubleshoot Linux servers and HPC clusters systems (Red Hat, CentOS, Ubuntu, or others) across physical (primarily), virtual, and cloud environments.
- Support, maintain, and optimize HPC systems, including cluster manager, operating system and network fabric installation, servicing, and advanced technical troubleshooting of hardware/software and parallel file systems (e.g., Lustre, GPFS).
- Monitor system performance, availability, and security using industry-standard tools and practices; ensure compliance with organizational policies and external regulations.
- Plan and execute upgrades, patches, enhancements, and migrations to ensure systems are current, secure, and optimized.
- Automate system administration tasks using scripting languages (Bash, Python, Perl, etc.) and configuration management tools (Ansible, Puppet, Chef, Terraform).
- Implement and maintain backup/recovery strategies, disaster recovery plans, and system documentation.
- Collaborate with development, network, and security teams to support application deployments and troubleshoot issues, particularly in multi-technology HPC environments.
- Provide technical consulting, mentoring, and guidance to junior team members and contribute to internal knowledge sharing.
- Ensure compliance with strict security protocols in sensitive environments (e.g., government, research); TSPV clearance will be required.
- Participate in on-call rotation and respond to system incidents and outages.
- Assist with technical proposals, solution design, and enterprise-level architecture for new projects and customer engagements.
Benefits
- Health & Wellbeing
- Personal & Professional Development
- Unconditional Inclusion