The NMCI Service Management Integration and Transport group at Leidos has an opening for a Site Reliability Engineer to focus on the reliability, performance, and scalability of complex distributed systems. The SRE will develop and execute tests focused on system resilience, performance under load, and failure scenarios, and work in tandem with other Site Reliability Engineers and development teams to create automated testing frameworks.
Requirements
- Work alongside the development and operations teams to ensure speedy and reliable software deployments, monitor systems, and improve overall reliability of the platform.
- Develop features utilize the AI coding tool and repository of scripts to automate, scale, test, and secure the cloud infrastructure and the pipelines.
- Enhance performance monitoring of the various systems via Splunk or other dashboard reporting tools.
- Identify performance bottlenecks and optimize the performance of cloud infrastructure.
- Contribute to continuing our SRE journey by suggesting ways to improve engineering build, maintenance, automation and reliability across the platform with SRE/DevOps tools and Infrastructure-as-Code.
- Develop and code high-quality pipeline automation workflows to support inside and outside the cloud platform that are appropriate for business and technology strategies.
- Develop and execute test strategies that simulate real-world failure scenarios, including network disruptions, hardware failures, and system overloads.
- Create, script, and run performance tests to measure system behavior under varying levels of load and traffic. Identify bottlenecks, performance degradation, and areas for optimization.
- Design, implement, and maintain automated test suites for infrastructure and application components.
- Ensure that testing is integrated into the CI/CD pipeline to validate system reliability with every release.
- Build automated systems for continuous performance testing, stress testing, and load testing.
- Work closely with SREs, developers, and operations teams to define reliability goals and develop appropriate testing strategies to validate those goals.
- Ensure that new services and features undergo thorough testing for performance, reliability, and failure recovery before deployment to production.
- Validate that monitoring, logging, and alerting mechanisms are functioning correctly by testing systems under failure conditions.
- Ensure that Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are accurately measured and tracked through automated testing frameworks.
Benefits
- 401k Matching
- Retirement Plan
- Generous Paid Time Off