RapidSOS is seeking a Senior Site Reliability Engineer to own the performance and stability of services that operate at scale in real-world, high-stakes environments. The ideal candidate will have 5+ years of professional engineering experience with deep expertise in Python, real cloud infrastructure experience with AWS, and hands-on Kubernetes experience with containerized workloads in production.
Requirements
- 5+ years of professional engineering experience with deep expertise in Python
- Real cloud infrastructure experience with AWS: networking, managed databases, cost implications of traffic routing decisions, IAM, DNS-based routing and failover
- Hands-on kubernetes experience with containerized workloads in production across EKS, ECS, or Fargate
- Strong understanding of distributed systems and how they fail, including resource exhaustion, replication lag, queue backpressure, and other common failure modes
- Experience operating high-throughput messaging systems (RabbitMQ, Kafka, AWS SNS / SQS, etc.) and the infrastructure around them, including infrastructure-as-code (e.g., Terraform) and CI/CD pipelines, with an emphasis on improving reliability and scalability
- Experience building or improving observability through logging, metrics, and alerting
- Demonstrable experience in using AI to safely and securely enhance velocity, improve reliability and recoverability of services
- Strong communication and interpersonal skills; is a team player with a positive attitude
- Highly self-motivated; ability to adapt and learn quickly in a fast-paced environment with a strong sense of ownership
- Strong proficiency in coding best practices – ability to write clean, maintainable, and testable code
- Demonstrated expertise in problem solving – comfortable working across both infrastructure and application layers to diagnose and resolve issues at the source
- Ability and willingness to collaborate in-person a few times per quarter, or as needed
Benefits
- Competitive salary and benefits
- Equity participation