Senior Site Reliability Engineer
Company: Xops
Location: Pleasanton
Posted on: January 23, 2025
Job Description:
The Senior Site Reliability Engineer (SRE) plays a vital role in
ensuring the reliability, scalability, and performance of our
enterprise software platform. This is a senior-level position that
requires deep technical expertise, strong problem-solving skills,
and the ability to collaborate effectively in a fast-paced,
demanding environment. Our customers, the largest enterprises in
the world, expect 24/7 platform availability and top-tier
performance.The ideal candidate has strong expertise in AWS cloud
technologies, a deep understanding of serverless architectures (AWS
Lambda), and a passion for building resilient systems to enhance
the customer experience.Platform Reliability:
- Design, implement, and manage highly available and scalable
systems to meet customer expectations for 24/7 uptime.
- Monitor, troubleshoot, and resolve platform incidents using
tools such as Sentry, New Relic, and custom monitoring
frameworks.
- Lead post-incident reviews to ensure root cause analysis and
preventative measures are in place.Automation and Optimization:
- Develop and maintain automation for infrastructure management,
monitoring, and incident response.
- Optimize platform performance and scalability, proactively
identifying and addressing bottlenecks.
- Contribute to the development of CI/CD pipelines to improve
deployment reliability and speed.Collaboration:
- Partner with L2 engineers to resolve complex customer issues,
providing guidance and technical expertise as needed.
- Work closely with product engineering to ensure platform
improvements align with customer needs.
- Actively contribute to the documentation and sharing of best
practices to improve team performance and customer
outcomes.Leadership:
- Mentor junior engineers and provide technical leadership in
reliability engineering.
- Drive cross-functional initiatives to improve platform
stability and customer satisfaction.Minimum Requirements:
- Bachelor's degree in Computer Science or related
discipline.
- 8+ years in a Site Reliability Engineering or DevOps role, with
experience supporting enterprise-grade software platforms.
- 3+ years of experience in cloud services, in particular
AWS.
- Experience building observability systems on New Relic,
Cloudwatch or similar.
- Experience implementing rate-limiting, API gateways, and load
balancing for highly available systems.
- Exposure to security best practices and compliance frameworks
(e.g., SOC2, ISO27001).
- Proficient in infrastructure as code (IaC) using tools such as
Terraform or CloudFormation.
- Hands-on experience with scripting and programming languages
like Python, Go, or Bash.
- Strong troubleshooting and debugging skills.
- Excellent communication and collaboration skills.
- Experience with incident management and post-mortem
practices.Soft Skills:
- Exceptional problem-solving and critical thinking
abilities.
- Strong verbal and written communication skills, with the
ability to navigate ambiguity and provide clarity.
- Ability to work collaboratively in cross-functional teams under
pressure.Key Attributes:
- Reliability-Driven: Strong commitment to platform reliability
and performance.
- Leadership and Mentorship: Willingness to guide and mentor less
experienced team members.
- Customer-Focused: Dedication to meeting and exceeding customer
expectations in a high-pressure environment.Expectations:
- Availability to participate in a 24/7 on-call rotation.
- Ability to work in a fast-paced, ambiguous environment with
rapidly changing priorities.
- Proactive approach to identifying and mitigating risks before
they impact customers.
- Strong sense of accountability and ownership for platform
stability and customer satisfaction.
#J-18808-Ljbffr
Keywords: Xops, Pleasanton , Senior Site Reliability Engineer, Engineering , Pleasanton, California
Didn't find what you're looking for? Search again!
Loading more jobs...