Site Reliability Engineer
Company: Together AI
Location: San Francisco
Posted on: April 2, 2026
|
|
|
Job Description:
About the Role As a Site Reliability Engineer (SRE) at Together,
you are responsible for keeping all user-facing services and
production systems running smoothly. You are a blend of a pragmatic
operator and a software engineer that applies sound engineering
principles, operational discipline, and mature automation to our
operating environments and codebase. You specialize in systems
(operating systems, storage subsystems, networking), while
implementing best practices for availability, reliability and
scalability, with varied interests in algorithms and distributed
systems. Responsibilities Participate in on-call rotation
(Pagerduty) to respond to production incidents Build and run our
infrastructure with Ansible, Terraform, and Kubernetes to enable
scaling to a massive number of concurrent users Build monitoring
systems to ensure the highest quality service for our customers
Design and implement operational processes (such as deployments and
upgrades) Debug production issues across all services and levels of
the stack Identify improvements for the product architecture from
the reliability, performance and availability perspectives Plan the
growth of Together AI’s infrastructure Requirements 5 years of
professional SRE or related experience Bachelor's degree in
Computer Science or a related field or equivalent work experience
Knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes
Proficiency in programming/scripting languages Direct experience in
monitoring and observability practices Knowledge of cloud services
Ability to thrive in a collaborative environment involving
different stakeholders and subject matter experts About Together AI
Together AI is a research-driven artificial intelligence company.
We believe open and transparent AI systems will drive innovation
and create the best outcomes for society, and together we are on a
mission to significantly lower the cost of modern AI systems by
co-designing software, hardware, algorithms, and models. We have
contributed to leading open-source research, models, and datasets
to advance the frontier of AI, and our team has been behind
technological advancement such as FlashAttention, Hyena, FlexGen,
and RedPajama. We invite you to join a passionate group of
researchers and engineers in our journey in building the next
generation AI infrastructure. Compensation We offer competitive
compensation, startup equity, health insurance and other
competitive benefits. The US base salary range for this full-time
position is: $150,000 - $200,000 equity benefits. Our salary ranges
are determined by location, level and role. Individual compensation
will be determined by experience, skills, and job-related
knowledge. Equal Opportunity Together AI is an Equal Opportunity
Employer and is proud to offer equal employment opportunity to
everyone regardless of race, color, ancestry, religion, sex,
national origin, sexual orientation, age, citizenship, marital
status, disability, gender identity, veteran status, and more.
Please see our privacy policy at
https://www.together.ai/privacy
Keywords: Together AI, Pleasanton , Site Reliability Engineer, IT / Software / Systems , San Francisco, California