As a Senior Site Reliability Engineer, Observability, you will play a key part in maturing our observability capabilities, standardizing instrumentation, improving telemetry quality, and enabling faster root cause analysis. You will get to support and evolve end-to-end observability solutions, administer and operate core observability platforms, and drive the adoption and standardization of instrumentation practices across services.
Requirements
- Bachelor’s degree in Computer Science or equivalent practical experience.
- 7+ years of experience in Observability, SRE, DevOps, or platform engineering roles supporting production systems.
- Strong understanding of APM and SRE fundamentals, including MELT (Metrics, Events, Logs, Traces), latency analysis, error rate monitoring, service dependency mapping, SLIs/SLOs, alert tuning, and root cause analysis.
- Hands-on experience administering at least one modern observability/APM platform (e.g., Splunk, New Relic, Grafana), with practical exposure to metrics, logs, distributed tracing, and platform configuration.
- Experience building dashboards and actionable alerts, including configuring alert workflows and integrations with incident management tools such as PagerDuty.
- Experience implementing or supporting OpenTelemetry-based instrumentation and improving telemetry quality across services, with a focus on reducing alert fatigue and improving signal-to-noise ratio.
- Familiarity with Kubernetes and cloud-native environments - an understanding of how applications are deployed, monitored, and scaled, including troubleshooting complex production issues in distributed environments.
- Experience managing telemetry pipelines and agents (e.g., collectors, forwarders, sidecars), including onboarding services and troubleshooting ingestion issues, and optimizing pipelines for scale and efficiency.
- Working knowledge of scripting or automation (e.g., Shell, Python) and CI/CD concepts.
- Experience leading or contributing to incident investigations and postmortems, identifying observability gaps and driving continuous improvement.
- Relevant certifications such as New Relic APM Professional, Reliability Engineer – Professional, Splunk Admin, or GCP Associate Cloud Engineer are a plus.
Benefits
- Health & wellness coverage including medical, dental, vision, and mental health resources
- Generous time off including PTO, holidays, a company-wide Priceline Pause reset week, and paid volunteer days
- Work/life support including the ability to work up to 4 weeks per year from anywhere, parental leave, dependent care and family support resources, Summer Fridays, and office perks like stocked kitchens and catered meals (varies by location)
- Financial security programs such as retirement plans with company contributions, life and disability coverage, and tax-advantaged accounts
- Signature travel perks including employee-only discounts on hotels and flights, VIP deals, and Big Deal Bucks credits
- Additional perks & discounts like travel and partner discounts, tuition support, legal support, and pet benefits